Let us now define a vector \( \boldsymbol{y} \) with \( n \) elements \( y_i \), an \( n\times p \) matrix \( \boldsymbol{X} \) which contains the \( x_i \) values and a vector \( \boldsymbol{p} \) of fitted probabilities \( p(y_i\vert x_i,\boldsymbol{\theta}) \). We can rewrite in a more compact form the first derivative of the cost function as
$$ \frac{\partial \mathcal{C}(\boldsymbol{\theta})}{\partial \boldsymbol{\theta}} = -\boldsymbol{X}^T\left(\boldsymbol{y}-\boldsymbol{p}\right). $$If we in addition define a diagonal matrix \( \boldsymbol{W} \) with elements \( p(y_i\vert x_i,\boldsymbol{\theta})(1-p(y_i\vert x_i,\boldsymbol{\theta}) \), we can obtain a compact expression of the second derivative as
$$ \frac{\partial^2 \mathcal{C}(\boldsymbol{\theta})}{\partial \boldsymbol{\theta}\partial \boldsymbol{\theta}^T} = \boldsymbol{X}^T\boldsymbol{W}\boldsymbol{X}. $$