Mathematical Interpretation of Ordinary Least Squares

What is presented here is a mathematical analysis of various regression algorithms (ordinary least squares, Ridge and Lasso Regression). The analysis is based on an important algorithm in linear algebra, the so-called Singular Value Decomposition (SVD).

We have shown that in ordinary least squares the optimal parameters \( \beta \) are given by

$$ \hat{\boldsymbol{\beta}} = \left(\boldsymbol{X}^T\boldsymbol{X}\right)^{-1}\boldsymbol{X}^T\boldsymbol{y}. $$

The hat over \( \boldsymbol{\beta} \) means we have the optimal parameters after minimization of the cost function.

This means that our best model is defined as

$$ \tilde{\boldsymbol{y}}=\boldsymbol{X}\hat{\boldsymbol{\beta}} = \boldsymbol{X}\left(\boldsymbol{X}^T\boldsymbol{X}\right)^{-1}\boldsymbol{X}^T\boldsymbol{y}. $$

We now define a matrix

$$ \boldsymbol{A}=\boldsymbol{X}\left(\boldsymbol{X}^T\boldsymbol{X}\right)^{-1}\boldsymbol{X}^T. $$

We can rewrite

$$ \tilde{\boldsymbol{y}}=\boldsymbol{X}\hat{\boldsymbol{\beta}} = \boldsymbol{A}\boldsymbol{y}. $$

The matrix \( \boldsymbol{A} \) has the important property that \( \boldsymbol{A}^2=\boldsymbol{A} \). This is the definition of a projection matrix. We can then interpret our optimal model \( \tilde{\boldsymbol{y}} \) as being represented by an orthogonal projection of \( \boldsymbol{y} \) onto a space defined by the column vectors of \( \boldsymbol{X} \). In our case here the matrix \( \boldsymbol{A} \) is a square matrix. If it is a general rectangular matrix we have an oblique projection matrix.