Deriving the Ridge Regression Equations

Using the matrix-vector expression for Ridge regression and dropping the parameter \( 1/n \) in front of the standard means squared error equation, we have

$$ C(\boldsymbol{X},\boldsymbol{\beta})=\left\{(\boldsymbol{y}-\boldsymbol{X}\boldsymbol{\beta})^T(\boldsymbol{y}-\boldsymbol{X}\boldsymbol{\beta})\right\}+\lambda\boldsymbol{\beta}^T\boldsymbol{\beta}, $$

and taking the derivatives with respect to \( \boldsymbol{\beta} \) we obtain then a slightly modified matrix inversion problem which for finite values of \( \lambda \) does not suffer from singularity problems. We obtain the optimal parameters

$$ \hat{\boldsymbol{\beta}}_{\mathrm{Ridge}} = \left(\boldsymbol{X}^T\boldsymbol{X}+\lambda\boldsymbol{I}\right)^{-1}\boldsymbol{X}^T\boldsymbol{y}, $$

with \( \boldsymbol{I} \) being a \( p\times p \) identity matrix with the constraint that

$$ \sum_{i=0}^{p-1} \beta_i^2 \leq t, $$

with \( t \) a finite positive number.

If we keep the \( 1/n \) factor, the equation for the optimal \( \beta \) changes to

$$ \hat{\boldsymbol{\beta}}_{\mathrm{Ridge}} = \left(\boldsymbol{X}^T\boldsymbol{X}+n\lambda\boldsymbol{I}\right)^{-1}\boldsymbol{X}^T\boldsymbol{y}. $$

In many textbooks the \( 1/n \) term is often omitted. Note that a library like Scikit-Learn does not include the \( 1/n \) factor in the setup of the cost function.

When we compare this with the ordinary least squares result we have

$$ \hat{\boldsymbol{\beta}}_{\mathrm{OLS}} = \left(\boldsymbol{X}^T\boldsymbol{X}\right)^{-1}\boldsymbol{X}^T\boldsymbol{y}, $$

which can lead to singular matrices. However, with the SVD, we can always compute the inverse of the matrix \( \boldsymbol{X}^T\boldsymbol{X} \).

We see that Ridge regression is nothing but the standard OLS with a modified diagonal term added to \( \boldsymbol{X}^T\boldsymbol{X} \). The consequences, in particular for our discussion of the bias-variance tradeoff are rather interesting. We will see that for specific values of \( \lambda \), we may even reduce the variance of the optimal parameters \( \boldsymbol{\beta} \). These topics and other related ones, will be discussed after the more linear algebra oriented analysis here.

Using our insights about the SVD of the design matrix \( \boldsymbol{X} \) We have already analyzed the OLS solutions in terms of the eigenvectors (the columns) of the right singular value matrix \( \boldsymbol{U} \) as

$$ \tilde{\boldsymbol{y}}_{\mathrm{OLS}}=\boldsymbol{X}\boldsymbol{\beta} =\boldsymbol{U}\boldsymbol{U}^T\boldsymbol{y}. $$

For Ridge regression this becomes

$$ \tilde{\boldsymbol{y}}_{\mathrm{Ridge}}=\boldsymbol{X}\boldsymbol{\beta}_{\mathrm{Ridge}} = \boldsymbol{U\Sigma V^T}\left(\boldsymbol{V}\boldsymbol{\Sigma}^2\boldsymbol{V}^T+\lambda\boldsymbol{I} \right)^{-1}(\boldsymbol{U\Sigma V^T})^T\boldsymbol{y}=\sum_{j=0}^{p-1}\boldsymbol{u}_j\boldsymbol{u}_j^T\frac{\sigma_j^2}{\sigma_j^2+\lambda}\boldsymbol{y}, $$

with the vectors \( \boldsymbol{u}_j \) being the columns of \( \boldsymbol{U} \) from the SVD of the matrix \( \boldsymbol{X} \).