{"id":180596,"date":"2025-01-07T17:29:52","date_gmt":"2025-01-07T17:29:52","guid":{"rendered":"https:\/\/learnexams.com\/blog\/?p=180596"},"modified":"2025-01-07T17:29:54","modified_gmt":"2025-01-07T17:29:54","slug":"gradient-and-hessian-of-log-likelihood-for-multinomial-logistic-regression","status":"publish","type":"post","link":"https:\/\/www.learnexams.com\/blog\/2025\/01\/07\/gradient-and-hessian-of-log-likelihood-for-multinomial-logistic-regression\/","title":{"rendered":"Gradient and Hessian of log-likelihood for multinomial logistic regression"},"content":{"rendered":"\n<p>Gradient and Hessian of log-likelihood for multinomial logistic regression<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" src=\"https:\/\/learnexams.com\/blog\/wp-content\/uploads\/2025\/01\/image-54.png\" alt=\"\" class=\"wp-image-180597\"\/><\/figure>\n\n\n\n<p>The gradient and Hessian of the log-likelihood function for multinomial logistic regression are essential for optimization in parameter estimation. Here\u2019s how they are derived and expressed:<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\">Log-Likelihood Function<\/h3>\n\n\n\n<p>For nn data points and KK classes, let:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>xi\\mathbf{x}_i be the feature vector for observation ii,<\/li>\n\n\n\n<li>\u03b2k\\mathbf{\\beta}_k be the coefficient vector for class kk,<\/li>\n\n\n\n<li>yiy_i be the true class label of observation ii,<\/li>\n\n\n\n<li>pikp_{ik} be the predicted probability for observation ii and class kk.<\/li>\n<\/ul>\n\n\n\n<p>The probability of yiy_i given xi\\mathbf{x}_i is: p(yi=k\u2223xi)=exp\u2061(\u03b2k\u22a4xi)\u2211j=1Kexp\u2061(\u03b2j\u22a4xi).p(y_i = k | \\mathbf{x}_i) = \\frac{\\exp(\\mathbf{\\beta}_k^\\top \\mathbf{x}_i)}{\\sum_{j=1}^K \\exp(\\mathbf{\\beta}_j^\\top \\mathbf{x}_i)}.<\/p>\n\n\n\n<p>The log-likelihood is: \u2113(\u03b2)=\u2211i=1n\u2211k=1K1(yi=k)log\u2061pik.\\ell(\\mathbf{\\beta}) = \\sum_{i=1}^n \\sum_{k=1}^K \\mathbf{1}(y_i = k) \\log p_{ik}.<\/p>\n\n\n\n<p>Here, 1(yi=k)\\mathbf{1}(y_i = k) is an indicator function equal to 1 if yi=ky_i = k, and 0 otherwise.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\">Gradient of the Log-Likelihood<\/h3>\n\n\n\n<p>Define \u03c0ik=pik\\pi_{ik} = p_{ik}, the predicted probability. The gradient with respect to \u03b2m\\mathbf{\\beta}_m is: \u2202\u2113\u2202\u03b2m=\u2211i=1nxi(1(yi=m)\u2212\u03c0im).\\frac{\\partial \\ell}{\\partial \\mathbf{\\beta}_m} = \\sum_{i=1}^n \\mathbf{x}_i \\left( \\mathbf{1}(y_i = m) &#8211; \\pi_{im} \\right).<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\">Hessian of the Log-Likelihood<\/h3>\n\n\n\n<p>The Hessian matrix HH consists of second derivatives: Hm,j=\u22022\u2113\u2202\u03b2m\u2202\u03b2j.H_{m,j} = \\frac{\\partial^2 \\ell}{\\partial \\mathbf{\\beta}_m \\partial \\mathbf{\\beta}_j}.<\/p>\n\n\n\n<p>For a single data point ii: Hm,j(i)=\u2212xixi\u22a4\u03c0im(\u03b4mj\u2212\u03c0ij),H_{m,j}^{(i)} = &#8211; \\mathbf{x}_i \\mathbf{x}_i^\\top \\pi_{im} (\\delta_{mj} &#8211; \\pi_{ij}),<\/p>\n\n\n\n<p>where \u03b4mj\\delta_{mj} is the Kronecker delta (1 if m=jm = j, 0 otherwise).<\/p>\n\n\n\n<p>Summing over all data points: Hm,j=\u2211i=1nHm,j(i).H_{m,j} = \\sum_{i=1}^n H_{m,j}^{(i)}.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\">Summary<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Gradient<\/strong>:<\/li>\n<\/ul>\n\n\n\n<p>\u2202\u2113\u2202\u03b2m=\u2211i=1nxi(1(yi=m)\u2212\u03c0im).\\frac{\\partial \\ell}{\\partial \\mathbf{\\beta}_m} = \\sum_{i=1}^n \\mathbf{x}_i \\left( \\mathbf{1}(y_i = m) &#8211; \\pi_{im} \\right).<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Hessian<\/strong>:<\/li>\n<\/ul>\n\n\n\n<p>Hm,j=\u2212\u2211i=1nxixi\u22a4\u03c0im(\u03b4mj\u2212\u03c0ij).H_{m,j} = &#8211; \\sum_{i=1}^n \\mathbf{x}_i \\mathbf{x}_i^\\top \\pi_{im} (\\delta_{mj} &#8211; \\pi_{ij}).<\/p>\n\n\n\n<p>These expressions are used in optimization algorithms like Newton-Raphson to estimate the model parameters. If you need further assistance or detailed derivations, feel free to ask!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Gradient and Hessian of log-likelihood for multinomial logistic regression The gradient and Hessian of the log-likelihood function for multinomial logistic regression are essential for optimization in parameter estimation. Here\u2019s how they are derived and expressed: Log-Likelihood Function For nn data points and KK classes, let: The probability of yiy_i given xi\\mathbf{x}_i is: p(yi=k\u2223xi)=exp\u2061(\u03b2k\u22a4xi)\u2211j=1Kexp\u2061(\u03b2j\u22a4xi).p(y_i = k [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"default","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","ast-disable-related-posts":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"footnotes":""},"categories":[25],"tags":[],"class_list":["post-180596","post","type-post","status-publish","format-standard","hentry","category-exams-certification"],"_links":{"self":[{"href":"https:\/\/www.learnexams.com\/blog\/wp-json\/wp\/v2\/posts\/180596","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.learnexams.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.learnexams.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.learnexams.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.learnexams.com\/blog\/wp-json\/wp\/v2\/comments?post=180596"}],"version-history":[{"count":0,"href":"https:\/\/www.learnexams.com\/blog\/wp-json\/wp\/v2\/posts\/180596\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.learnexams.com\/blog\/wp-json\/wp\/v2\/media?parent=180596"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.learnexams.com\/blog\/wp-json\/wp\/v2\/categories?post=180596"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.learnexams.com\/blog\/wp-json\/wp\/v2\/tags?post=180596"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}