Week 35: From Ordinary Linear Regression to Ridge and Lasso Regression

Interpretations and optimizing our parameters

We can rewrite, see the derivations below,

$\frac{\partial C(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}^T} = 0 = \boldsymbol{X}^T\left( \boldsymbol{y}-\boldsymbol{X}\boldsymbol{\beta}\right),$

$\boldsymbol{X}^T\boldsymbol{y} = \boldsymbol{X}^T\boldsymbol{X}\boldsymbol{\beta},$

and if the matrix $\boldsymbol{X}^T\boldsymbol{X}$ is invertible we have the solution

$\boldsymbol{\beta} =\left(\boldsymbol{X}^T\boldsymbol{X}\right)^{-1}\boldsymbol{X}^T\boldsymbol{y}.$

We note also that since our design matrix is defined as $\boldsymbol{X}\in {\mathbb{R}}^{n\times p}$ , the product $\boldsymbol{X}^T\boldsymbol{X} \in {\mathbb{R}}^{p\times p}$ . In most cases we have that $p \ll n$ . In our example case below we have $p=5$ meaning. We end up with inverting a small $5\times 5$ matrix. This is a rather common situation, in many cases we end up with low-dimensional matrices to invert. The methods discussed here and for many other supervised learning algorithms like classification with logistic regression or support vector machines, exhibit dimensionalities which allow for the usage of direct linear algebra methods such as LU decomposition or Singular Value Decomposition (SVD) for finding the inverse of the matrix $\boldsymbol{X}^T\boldsymbol{X}$ . This is discussed on Thursday this week.

Small question: Do you think the example we have at hand here (the nuclear binding energies) can lead to problems in inverting the matrix $\boldsymbol{X}^T\boldsymbol{X}$ ? What kind of problems can we expect?