Data Analysis and Machine Learning Lectures: Optimization and Gradient Methods

Loading [MathJax]/extensions/TeX/boldsymbol.js

The equations to solve

Our compact equations used a definition of a vector $\hat{y}$ with $n$ elements $y_i$ , an $n\times p$ matrix $\hat{X}$ which contains the $x_i$ values and a vector $\hat{p}$ of fitted probabilities $p(y_i\vert x_i,\hat{\beta})$ . We rewrote in a more compact form the first derivative of the cost function as $\frac{\partial \mathcal{C}(\hat{\beta})}{\partial \hat{\beta}} = -\hat{X}^T\left(\hat{y}-\hat{p}\right).$

If we in addition define a diagonal matrix $\hat{W}$ with elements $p(y_i\vert x_i,\hat{\beta})(1-p(y_i\vert x_i,\hat{\beta})$ , we can obtain a compact expression of the second derivative as $\frac{\partial^2 \mathcal{C}(\hat{\beta})}{\partial \hat{\beta}\partial \hat{\beta}^T} = \hat{X}^T\hat{W}\hat{X}.$ This defines what is called the Hessian matrix.