Fixing the singularity

If our design matrix \( \boldsymbol{X} \) which enters the linear regression problem

$$ \begin{align} \boldsymbol{\beta} & = (\boldsymbol{X}^{T} \boldsymbol{X})^{-1} \boldsymbol{X}^{T} \boldsymbol{y}, \tag{1} \end{align} $$

has linearly dependent column vectors, we will not be able to compute the inverse of \( \boldsymbol{X}^T\boldsymbol{X} \) and we cannot find the parameters (estimators) \( \beta_i \). The estimators are only well-defined if \( (\boldsymbol{X}^{T}\boldsymbol{X})^{-1} \) exits. This is more likely to happen when the matrix \( \boldsymbol{X} \) is high-dimensional. In this case it is likely to encounter a situation where the regression parameters \( \beta_i \) cannot be estimated.

A cheap ad hoc approach is simply to add a small diagonal component to the matrix to invert, that is we change

$$ \boldsymbol{X}^{T} \boldsymbol{X} \rightarrow \boldsymbol{X}^{T} \boldsymbol{X}+\lambda \boldsymbol{I}, $$

where \( \boldsymbol{I} \) is the identity matrix. When we discuss Ridge regression this is actually what we end up evaluating. The parameter \( \lambda \) is called a hyperparameter. More about this later.