Gradient and Hessian of log-likelihood for multinomial logistic regression

The gradient and Hessian of the log-likelihood function for multinomial logistic regression are essential for optimization in parameter estimation. Here’s how they are derived and expressed:

Log-Likelihood Function

For nn data points and KK classes, let:

xi\mathbf{x}_i be the feature vector for observation ii,
βk\mathbf{\beta}_k be the coefficient vector for class kk,
yiy_i be the true class label of observation ii,
pikp_{ik} be the predicted probability for observation ii and class kk.

The probability of yiy_i given xi\mathbf{x}_i is: p(yi=k∣xi)=exp⁡(βk⊤xi)∑j=1Kexp⁡(βj⊤xi).p(y_i = k | \mathbf{x}_i) = \frac{\exp(\mathbf{\beta}_k^\top \mathbf{x}_i)}{\sum_{j=1}^K \exp(\mathbf{\beta}_j^\top \mathbf{x}_i)}.

The log-likelihood is: ℓ(β)=∑i=1n∑k=1K1(yi=k)log⁡pik.\ell(\mathbf{\beta}) = \sum_{i=1}^n \sum_{k=1}^K \mathbf{1}(y_i = k) \log p_{ik}.

Here, 1(yi=k)\mathbf{1}(y_i = k) is an indicator function equal to 1 if yi=ky_i = k, and 0 otherwise.

Gradient of the Log-Likelihood

Define πik=pik\pi_{ik} = p_{ik}, the predicted probability. The gradient with respect to βm\mathbf{\beta}_m is: ∂ℓ∂βm=∑i=1nxi(1(yi=m)−πim).\frac{\partial \ell}{\partial \mathbf{\beta}_m} = \sum_{i=1}^n \mathbf{x}_i \left( \mathbf{1}(y_i = m) – \pi_{im} \right).

Hessian of the Log-Likelihood

The Hessian matrix HH consists of second derivatives: Hm,j=∂2ℓ∂βm∂βj.H_{m,j} = \frac{\partial^2 \ell}{\partial \mathbf{\beta}_m \partial \mathbf{\beta}_j}.

For a single data point ii: Hm,j(i)=−xixi⊤πim(δmj−πij),H_{m,j}^{(i)} = – \mathbf{x}_i \mathbf{x}_i^\top \pi_{im} (\delta_{mj} – \pi_{ij}),

where δmj\delta_{mj} is the Kronecker delta (1 if m=jm = j, 0 otherwise).

Summing over all data points: Hm,j=∑i=1nHm,j(i).H_{m,j} = \sum_{i=1}^n H_{m,j}^{(i)}.

Summary

Gradient:

∂ℓ∂βm=∑i=1nxi(1(yi=m)−πim).\frac{\partial \ell}{\partial \mathbf{\beta}_m} = \sum_{i=1}^n \mathbf{x}_i \left( \mathbf{1}(y_i = m) – \pi_{im} \right).

Hessian:

Hm,j=−∑i=1nxixi⊤πim(δmj−πij).H_{m,j} = – \sum_{i=1}^n \mathbf{x}_i \mathbf{x}_i^\top \pi_{im} (\delta_{mj} – \pi_{ij}).

These expressions are used in optimization algorithms like Newton-Raphson to estimate the model parameters. If you need further assistance or detailed derivations, feel free to ask!

Log-Likelihood Function

Gradient of the Log-Likelihood

Hessian of the Log-Likelihood

Summary

Related Posts