A very important matrix we will meet again and again in machine learning is the Hessian. It is given by the second derivative of the cost function with respect to the parameters \( \boldsymbol{\beta} \). Using the above expression for derivatives of vectors and matrices, we find that the second derivative of the mean squared error as cost function is,
$$ \frac{\partial}{\partial \boldsymbol{\beta}}\frac{\partial C(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}^T} =\frac{\partial}{\partial \boldsymbol{\beta}}\left[-\frac{2}{n}\boldsymbol{X}^T\left( \boldsymbol{y}-\boldsymbol{X}\boldsymbol{\beta}\right)\right]=\frac{2}{n}\boldsymbol{X}^T\boldsymbol{X}. $$The Hessian matrix plays an important role and is defined here as
$$ \boldsymbol{H}=\boldsymbol{X}^T\boldsymbol{X}. $$For ordinary least squares, it is inversely proportional (derivation next week) with the variance of the optimal parameters \( \hat{\boldsymbol{\beta}} \). Furthermore, we will see later this week that it is (aside the factor \( 1/n \)) equal to the covariance matrix. It plays also a very important role in optmization algorithms and Principal Component Analysis as a way to reduce the dimensionality of a machine learning/data analysis problem.
Linear algebra question: Can we use the Hessian matrix to say something about properties of the cost function (our optmization problem)? (hint: think about convex or concave problems and how to relate these to a matrix!).