The equations to solve

Our compact equations used a definition of a vector \( \boldsymbol{y} \) with \( n \) elements \( y_i \), an \( n\times p \) matrix \( \boldsymbol{X} \) which contains the \( x_i \) values and a vector \( \boldsymbol{p} \) of fitted probabilities \( p(y_i\vert x_i,\boldsymbol{\beta}) \). We rewrote in a more compact form the first derivative of the cost function as

$$ \frac{\partial \mathcal{C}(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}} = -\boldsymbol{X}^T\left(\boldsymbol{y}-\boldsymbol{p}\right). $$

If we in addition define a diagonal matrix \( \boldsymbol{W} \) with elements \( p(y_i\vert x_i,\boldsymbol{\beta})(1-p(y_i\vert x_i,\boldsymbol{\beta}) \), we can obtain a compact expression of the second derivative as

$$ \frac{\partial^2 \mathcal{C}(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}\partial \boldsymbol{\beta}^T} = \boldsymbol{X}^T\boldsymbol{W}\boldsymbol{X}. $$

This defines what is called the Hessian matrix.