Wrapping it up

If we minimize with respect to \( \boldsymbol{\beta} \) we have then

$$ \hat{\boldsymbol{\beta}} = (\tilde{X}^T\tilde{X})^{-1}\tilde{X}^T\boldsymbol{\tilde{y}}, $$

where \( \boldsymbol{\tilde{y}} = \boldsymbol{y} - \overline{\boldsymbol{y}} \) and \( \tilde{X}_{ij} = X_{ij} - \frac{1}{n}\sum_{k=0}^{n-1}X_{kj} \).

For Ridge regression we need to add \( \lambda \boldsymbol{\beta}^T\boldsymbol{\beta} \) to the cost function and get then

$$ \hat{\boldsymbol{\beta}} = (\tilde{X}^T\tilde{X} + \lambda I)^{-1}\tilde{X}^T\boldsymbol{\tilde{y}}. $$

What does this mean? And why do we insist on all this? Let us look at some examples.