Data Analysis and Machine Learning Lectures: Optimization and Gradient Methods

Gradient descent and Ridge

We have also discussed Ridge regression where the loss function contains a regularized given by the

$L_2$ norm of

$\beta$ ,

$C_{\text{ridge}}(\beta) = ||X\beta -\mathbf{y}||^2 + \lambda ||\beta||^2, \ \lambda \geq 0.$

In order to minimize

$C_{\text{ridge}}(\beta)$ using GD we only have adjust the gradient as follows

$\nabla_\beta C_{\text{ridge}}(\beta) = 2\begin{bmatrix} \sum_{i=1}^{100} \left(\beta_0+\beta_1x_i-y_i\right) \\ \sum_{i=1}^{100}\left( x_i (\beta_0+\beta_1x_i)-y_ix_i\right) \\ \end{bmatrix} + 2\lambda\begin{bmatrix} \beta_0 \\ \beta_1\end{bmatrix} = 2 (X^T(X\beta - \mathbf{y})+\lambda \beta).$

We can now extend our program to minimize

$C_{\text{ridge}}(\beta)$ using gradient descent and compare with the analytical solution given by

$\beta_{\text{ridge}} = \left(X^T X + \lambda I_{2 \times 2} \right)^{-1} X^T \mathbf{y},$ for

$\lambda = {0,1,10,50,100}$ (

$\lambda = 0$ corresponds to ordinary least squares). We can then compute

$||\beta_{\text{ridge}}||$ for each

$\lambda$ .

import numpy as np

"""
The following setup is just a suggestion, feel free to write it the way you like.
"""

#Setup problem described in the exercise
N  = 100 #Nr of datapoints
M  = 2   #Nr of features
x  = np.random.rand(N)
y  = 5*x**2 + 0.1*np.random.randn(N)


#Compute analytic beta for Ridge regression 
X    = np.c_[np.ones(N),x]
XT_X = np.dot(X.T,X)

l  = 0.1 #Ridge parameter lambda
Id = np.eye(XT_X.shape[0])

Z = np.linalg.inv(XT_X+l*Id)
beta_ridge = np.dot(Z,np.dot(X.T,y))

print(beta_ridge)
print(np.linalg.norm(beta_ridge)) #||beta||