Loading [MathJax]/extensions/TeX/boldsymbol.js

 

 

 

Gradient descent and Ridge

We have also discussed Ridge regression where the loss function contains a regularized given by the L_2 norm of \beta , C_{\text{ridge}}(\beta) = ||X\beta -\mathbf{y}||^2 + \lambda ||\beta||^2, \ \lambda \geq 0.

In order to minimize C_{\text{ridge}}(\beta) using GD we only have adjust the gradient as follows \nabla_\beta C_{\text{ridge}}(\beta) = 2\begin{bmatrix} \sum_{i=1}^{100} \left(\beta_0+\beta_1x_i-y_i\right) \\ \sum_{i=1}^{100}\left( x_i (\beta_0+\beta_1x_i)-y_ix_i\right) \\ \end{bmatrix} + 2\lambda\begin{bmatrix} \beta_0 \\ \beta_1\end{bmatrix} = 2 (X^T(X\beta - \mathbf{y})+\lambda \beta).

We can now extend our program to minimize C_{\text{ridge}}(\beta) using gradient descent and compare with the analytical solution given by \beta_{\text{ridge}} = \left(X^T X + \lambda I_{2 \times 2} \right)^{-1} X^T \mathbf{y}, for \lambda = {0,1,10,50,100} ( \lambda = 0 corresponds to ordinary least squares). We can then compute ||\beta_{\text{ridge}}|| for each \lambda .

import numpy as np

"""
The following setup is just a suggestion, feel free to write it the way you like.
"""

#Setup problem described in the exercise
N  = 100 #Nr of datapoints
M  = 2   #Nr of features
x  = np.random.rand(N)
y  = 5*x**2 + 0.1*np.random.randn(N)


#Compute analytic beta for Ridge regression 
X    = np.c_[np.ones(N),x]
XT_X = np.dot(X.T,X)

l  = 0.1 #Ridge parameter lambda
Id = np.eye(XT_X.shape[0])

Z = np.linalg.inv(XT_X+l*Id)
beta_ridge = np.dot(Z,np.dot(X.T,y))

print(beta_ridge)
print(np.linalg.norm(beta_ridge)) #||beta||