The equations to solve

Our compact equations used a definition of a vector \( \hat{y} \) with \( n \) elements \( y_i \), an \( n\times p \) matrix \( \hat{X} \) which contains the \( x_i \) values and a vector \( \hat{p} \) of fitted probabilities \( p(y_i\vert x_i,\hat{\beta}) \). We rewrote in a more compact form the first derivative of the cost function as $$ \frac{\partial \mathcal{C}(\hat{\beta})}{\partial \hat{\beta}} = -\hat{X}^T\left(\hat{y}-\hat{p}\right). $$

If we in addition define a diagonal matrix \( \hat{W} \) with elements \( p(y_i\vert x_i,\hat{\beta})(1-p(y_i\vert x_i,\hat{\beta}) \), we can obtain a compact expression of the second derivative as $$ \frac{\partial^2 \mathcal{C}(\hat{\beta})}{\partial \hat{\beta}\partial \hat{\beta}^T} = \hat{X}^T\hat{W}\hat{X}. $$ This defines what is called the Hessian matrix.