Our compact equations used a definition of a vector \hat{y} with n elements y_i , an n\times p matrix \hat{X} which contains the x_i values and a vector \hat{p} of fitted probabilities p(y_i\vert x_i,\hat{\beta}) . We rewrote in a more compact form the first derivative of the cost function as \frac{\partial \mathcal{C}(\hat{\beta})}{\partial \hat{\beta}} = -\hat{X}^T\left(\hat{y}-\hat{p}\right).
If we in addition define a diagonal matrix \hat{W} with elements p(y_i\vert x_i,\hat{\beta})(1-p(y_i\vert x_i,\hat{\beta}) , we can obtain a compact expression of the second derivative as \frac{\partial^2 \mathcal{C}(\hat{\beta})}{\partial \hat{\beta}\partial \hat{\beta}^T} = \hat{X}^T\hat{W}\hat{X}. This defines what is called the Hessian matrix.