Gradient descent and Ridge

We have also discussed Ridge regression where the loss function contains a regularized given by the \( L_2 \) norm of \( \beta \), $$ C_{\text{ridge}}(\beta) = ||X\beta -\mathbf{y}||^2 + \lambda ||\beta||^2, \ \lambda \geq 0. $$

In order to minimize \( C_{\text{ridge}}(\beta) \) using GD we only have adjust the gradient as follows $$ \nabla_\beta C_{\text{ridge}}(\beta) = 2\begin{bmatrix} \sum_{i=1}^{100} \left(\beta_0+\beta_1x_i-y_i\right) \\ \sum_{i=1}^{100}\left( x_i (\beta_0+\beta_1x_i)-y_ix_i\right) \\ \end{bmatrix} + 2\lambda\begin{bmatrix} \beta_0 \\ \beta_1\end{bmatrix} = 2 (X^T(X\beta - \mathbf{y})+\lambda \beta). $$

We can now extend our program to minimize \( C_{\text{ridge}}(\beta) \) using gradient descent and compare with the analytical solution given by $$ \beta_{\text{ridge}} = \left(X^T X + \lambda I_{2 \times 2} \right)^{-1} X^T \mathbf{y}, $$ for \( \lambda = {0,1,10,50,100} \) (\( \lambda = 0 \) corresponds to ordinary least squares). We can then compute \( ||\beta_{\text{ridge}}|| \) for each \( \lambda \).

import numpy as np

The following setup is just a suggestion, feel free to write it the way you like.

#Setup problem described in the exercise
N  = 100 #Nr of datapoints
M  = 2   #Nr of features
x  = np.random.rand(N)
y  = 5*x**2 + 0.1*np.random.randn(N)

#Compute analytic beta for Ridge regression 
X    = np.c_[np.ones(N),x]
XT_X =,X)

l  = 0.1 #Ridge parameter lambda
Id = np.eye(XT_X.shape[0])

Z = np.linalg.inv(XT_X+l*Id)
beta_ridge =,,y))

print(np.linalg.norm(beta_ridge)) #||beta||