Week 35: From Ordinary Linear Regression to Ridge and Lasso Regression

Meet the Hessian Matrix

A very important matrix we will meet again and again in machine learning is the Hessian. It is given by the second derivative of the cost function with respect to the parameters $ \boldsymbol{\theta} $. Using the above expression for derivatives of vectors and matrices, we find that the second derivative of the mean squared error as cost function is,

$$ \frac{\partial}{\partial \boldsymbol{\theta}}\frac{\partial C(\boldsymbol{\theta})}{\partial \boldsymbol{\theta}^T} =\frac{\partial}{\partial \boldsymbol{\theta}}\left[-\frac{2}{n}\boldsymbol{X}^T\left( \boldsymbol{y}-\boldsymbol{X}\boldsymbol{\theta}\right)\right]=\frac{2}{n}\boldsymbol{X}^T\boldsymbol{X}. $$

The Hessian matrix plays an important role and is defined here as

$$ \boldsymbol{H}=\boldsymbol{X}^T\boldsymbol{X}. $$

For ordinary least squares, it is inversely proportional (derivation next week) with the variance of the optimal parameters $ \hat{\boldsymbol{\theta}} $. Furthermore, we will see next week that it is (aside the factor $ 1/n $) equal to the covariance matrix. It plays also a very important role in optmization algorithms and Principal Component Analysis as a way to reduce the dimensionality of a machine learning/data analysis problem. We will discuss this in greater detail next week when we introduce gradient methods.

Linear algebra question: Can we use the Hessian matrix to say something about properties of the cost function (our optmization problem)? (hint: think about convex or concave problems and how to relate these to a matrix!).