Week 36: Linear Regression and Gradient descent

Morten Hjorth-Jensen Department of Physics, University of Oslo, Norway

September 1-5, 2025

Plans for week 36

Material for the lecture on Monday September 1:

Linear Regression, ordinary least squares (OLS), Ridge and Lasso and mathematical analysis
Derivation of Gradient descent and discussion of implementations for
Video of lecture at https://youtu.be/nVE_FRnGAHw
Whiteboard notes at https://github.com/CompPhysics/MachineLearning/blob/master/doc/HandWrittenNotes/2025/FYSSTKweek36.pdf

Material for the lab sessions on Tuesday and Wednesday (see at the end of these slides):

Technicalities concerning Ridge and Lasso linear regression.
Presentation and discussion of the first project

Reading suggestion:

Goodfellow et al, Deep Learning, introduction to gradient descent, see chapter 4.3 at https://www.deeplearningbook.org/contents/numerical.html
Rashcka et al, pages 37-44 and pages 278-283 with focus on linear regression.
Video on gradient descent at https://www.youtube.com/watch?v=sDv4f4s2SB8

Material for lecture Monday September 2

Mathematical Interpretation of Ordinary Least Squares

What is presented here is a mathematical analysis of various regression algorithms (ordinary least squares, Ridge and Lasso Regression). The analysis is based on an important algorithm in linear algebra, the so-called Singular Value Decomposition (SVD).

We have shown that in ordinary least squares the optimal parameters $ \theta $ are given by

$$ \hat{\boldsymbol{\theta}} = \left(\boldsymbol{X}^T\boldsymbol{X}\right)^{-1}\boldsymbol{X}^T\boldsymbol{y}. $$

The hat over $ \boldsymbol{\theta} $ means we have the optimal parameters after minimization of the cost function.

This means that our best model is defined as

$$ \tilde{\boldsymbol{y}}=\boldsymbol{X}\hat{\boldsymbol{\theta}} = \boldsymbol{X}\left(\boldsymbol{X}^T\boldsymbol{X}\right)^{-1}\boldsymbol{X}^T\boldsymbol{y}. $$

We now define a matrix

$$ \boldsymbol{A}=\boldsymbol{X}\left(\boldsymbol{X}^T\boldsymbol{X}\right)^{-1}\boldsymbol{X}^T. $$

We can rewrite

$$ \tilde{\boldsymbol{y}}=\boldsymbol{X}\hat{\boldsymbol{\theta}} = \boldsymbol{A}\boldsymbol{y}. $$

The matrix $ \boldsymbol{A} $ has the important property that $ \boldsymbol{A}^2=\boldsymbol{A} $. This is the definition of a projection matrix. We can then interpret our optimal model $ \tilde{\boldsymbol{y}} $ as being represented by an orthogonal projection of $ \boldsymbol{y} $ onto a space defined by the column vectors of $ \boldsymbol{X} $. In our case here the matrix $ \boldsymbol{A} $ is a square matrix. If it is a general rectangular matrix we have an oblique projection matrix.

Residual Error

We have defined the residual error as

$$ \boldsymbol{\epsilon}=\boldsymbol{y}-\tilde{\boldsymbol{y}}=\left[\boldsymbol{I}-\boldsymbol{X}\left(\boldsymbol{X}^T\boldsymbol{X}\right)^{-1}\boldsymbol{X}^T\right]\boldsymbol{y}. $$

The residual errors are then the projections of $ \boldsymbol{y} $ onto the orthogonal component of the space defined by the column vectors of $ \boldsymbol{X} $.

Simple case

If the matrix $ \boldsymbol{X} $ is an orthogonal (or unitary in case of complex values) matrix, we have

$$ \boldsymbol{X}^T\boldsymbol{X}=\boldsymbol{X}\boldsymbol{X}^T = \boldsymbol{I}. $$

In this case the matrix $ \boldsymbol{A} $ becomes

$$ \boldsymbol{A}=\boldsymbol{X}\left(\boldsymbol{X}^T\boldsymbol{X}\right)^{-1}\boldsymbol{X}^T)=\boldsymbol{I}, $$

and we have the obvious case

$$ \boldsymbol{\epsilon}=\boldsymbol{y}-\tilde{\boldsymbol{y}}=0. $$

This serves also as a useful test of our codes.

The singular value decomposition

The examples we have looked at so far are cases where we normally can invert the matrix $ \boldsymbol{X}^T\boldsymbol{X} $. Using a polynomial expansion where we fit of various functions leads to row vectors of the design matrix which are essentially orthogonal due to the polynomial character of our model. Obtaining the inverse of the design matrix is then often done via a so-called LU, QR or Cholesky decomposition.

As we will also see in the first project, this may however not the be case in general and a standard matrix inversion algorithm based on say LU, QR or Cholesky decomposition may lead to singularities. We will see examples of this below and in other examples.

There is however a way to circumvent this problem and also gain some insights about the ordinary least squares approach, and later shrinkage methods like Ridge and Lasso regressions.

This is given by the Singular Value Decomposition (SVD) algorithm, perhaps the most powerful linear algebra algorithm. The SVD provides a numerically stable matrix decomposition that is used in a large swath oc applications and the decomposition is always stable numerically.

In machine learning it plays a central role in dealing with for example design matrices that may be near singular or singular. Furthermore, as we will see here, the singular values can be related to the covariance matrix (and thereby the correlation matrix) and in turn the variance of a given quantity. It plays also an important role in the principal component analysis where high-dimensional data can be reduced to the statistically relevant features.

Linear Regression Problems

One of the typical problems we encounter with linear regression, in particular when the matrix $ \boldsymbol{X} $ (our so-called design matrix) is high-dimensional, are problems with near singular or singular matrices. The column vectors of $ \boldsymbol{X} $ may be linearly dependent, normally referred to as super-collinearity. This means that the matrix may be rank deficient and it is basically impossible to to model the data using linear regression. As an example, consider the matrix

$$ \begin{align*} \mathbf{X} & = \left[ \begin{array}{rrr} 1 & -1 & 2 \\ 1 & 0 & 1 \\ 1 & 2 & -1 \\ 1 & 1 & 0 \end{array} \right] \end{align*} $$

The columns of $ \boldsymbol{X} $ are linearly dependent. We see this easily since the the first column is the row-wise sum of the other two columns. The rank (more correct, the column rank) of a matrix is the dimension of the space spanned by the column vectors. Hence, the rank of $ \mathbf{X} $ is equal to the number of linearly independent columns. In this particular case the matrix has rank 2.

Super-collinearity of an $ (n \times p) $-dimensional design matrix $ \mathbf{X} $ implies that the inverse of the matrix $ \boldsymbol{X}^T\boldsymbol{X} $ (the matrix we need to invert to solve the linear regression equations) is non-invertible. If we have a square matrix that does not have an inverse, we say this matrix singular. The example here demonstrates this

$$ \begin{align*} \boldsymbol{X} & = \left[ \begin{array}{rr} 1 & -1 \\ 1 & -1 \end{array} \right]. \end{align*} $$

We see easily that $ \mbox{det}(\boldsymbol{X}) = x_{11} x_{22} - x_{12} x_{21} = 1 \times (-1) - 1 \times (-1) = 0 $. Hence, $ \mathbf{X} $ is singular and its inverse is undefined. This is equivalent to saying that the matrix $ \boldsymbol{X} $ has at least an eigenvalue which is zero.

Fixing the singularity

If our design matrix $ \boldsymbol{X} $ which enters the linear regression problem

$$ \begin{align} \boldsymbol{\theta} & = (\boldsymbol{X}^{T} \boldsymbol{X})^{-1} \boldsymbol{X}^{T} \boldsymbol{y}, \label{_auto1} \end{align} $$

has linearly dependent column vectors, we will not be able to compute the inverse of $ \boldsymbol{X}^T\boldsymbol{X} $ and we cannot find the parameters (estimators) $ \theta_i $. The estimators are only well-defined if $ (\boldsymbol{X}^{T}\boldsymbol{X})^{-1} $ exists. This is more likely to happen when the matrix $ \boldsymbol{X} $ is high-dimensional. In this case it is likely to encounter a situation where the regression parameters $ \theta_i $ cannot be estimated.

A cheap ad hoc approach is simply to add a small diagonal component to the matrix to invert, that is we change

$$ \boldsymbol{X}^{T} \boldsymbol{X} \rightarrow \boldsymbol{X}^{T} \boldsymbol{X}+\lambda \boldsymbol{I}, $$

where $ \boldsymbol{I} $ is the identity matrix. When we discuss Ridge regression this is actually what we end up evaluating. The parameter $ \lambda $ is called a hyperparameter. More about this later.

Ridge and LASSO Regression

Let us remind ourselves about the expression for the standard Mean Squared Error (MSE) which we used to define our cost function and the equations for the ordinary least squares (OLS) method, that is our optimization problem is

$$ {\displaystyle \min_{\boldsymbol{\theta}\in {\mathbb{R}}^{p}}}\frac{1}{n}\left\{\left(\boldsymbol{y}-\boldsymbol{X}\boldsymbol{\theta}\right)^T\left(\boldsymbol{y}-\boldsymbol{X}\boldsymbol{\theta}\right)\right\}. $$

or we can state it as

$$ {\displaystyle \min_{\boldsymbol{\theta}\in {\mathbb{R}}^{p}}}\frac{1}{n}\sum_{i=0}^{n-1}\left(y_i-\tilde{y}_i\right)^2=\frac{1}{n}\vert\vert \boldsymbol{y}-\boldsymbol{X}\boldsymbol{\theta}\vert\vert_2^2, $$

where we have used the definition of a norm-2 vector, that is

$$ \vert\vert \boldsymbol{x}\vert\vert_2 = \sqrt{\sum_i x_i^2}. $$

By minimizing the above equation with respect to the parameters $ \boldsymbol{\theta} $ we could then obtain an analytical expression for the parameters $ \boldsymbol{\theta} $. We can add a regularization parameter $ \lambda $ by defining a new cost function to be optimized, that is

$$ {\displaystyle \min_{\boldsymbol{\theta}\in {\mathbb{R}}^{p}}}\frac{1}{n}\vert\vert \boldsymbol{y}-\boldsymbol{X}\boldsymbol{\theta}\vert\vert_2^2+\lambda\vert\vert \boldsymbol{\theta}\vert\vert_2^2 $$

which leads to the Ridge regression minimization problem where we require that $ \vert\vert \boldsymbol{\theta}\vert\vert_2^2\le t $, where $ t $ is a finite number larger than zero. By defining

$$ C(\boldsymbol{X},\boldsymbol{\theta})=\frac{1}{n}\vert\vert \boldsymbol{y}-\boldsymbol{X}\boldsymbol{\theta}\vert\vert_2^2+\lambda\vert\vert \boldsymbol{\theta}\vert\vert_1, $$

we have a new optimization equation

which leads to Lasso regression. Lasso stands for least absolute shrinkage and selection operator.

Here we have defined the norm-1 as

$$ \vert\vert \boldsymbol{x}\vert\vert_1 = \sum_i \vert x_i\vert. $$

Deriving the Ridge Regression Equations

Using the matrix-vector expression for Ridge regression and dropping the parameter $ 1/n $ in front of the standard means squared error equation, we have

$$ C(\boldsymbol{X},\boldsymbol{\theta})=\left\{(\boldsymbol{y}-\boldsymbol{X}\boldsymbol{\theta})^T(\boldsymbol{y}-\boldsymbol{X}\boldsymbol{\theta})\right\}+\lambda\boldsymbol{\theta}^T\boldsymbol{\theta}, $$

and taking the derivatives with respect to $ \boldsymbol{\theta} $ we obtain then a slightly modified matrix inversion problem which for finite values of $ \lambda $ does not suffer from singularity problems. We obtain the optimal parameters

$$ \hat{\boldsymbol{\theta}}_{\mathrm{Ridge}} = \left(\boldsymbol{X}^T\boldsymbol{X}+\lambda\boldsymbol{I}\right)^{-1}\boldsymbol{X}^T\boldsymbol{y}, $$

with $ \boldsymbol{I} $ being a $ p\times p $ identity matrix with the constraint that

$$ \sum_{i=0}^{p-1} \theta_i^2 \leq t, $$

with $ t $ a finite positive number.

If we keep the $ 1/n $ factor, the equation for the optimal $ \theta $ changes to

$$ \hat{\boldsymbol{\theta}}_{\mathrm{Ridge}} = \left(\boldsymbol{X}^T\boldsymbol{X}+n\lambda\boldsymbol{I}\right)^{-1}\boldsymbol{X}^T\boldsymbol{y}. $$

In many textbooks the $ 1/n $ term is often omitted. Note that a library like Scikit-Learn does not include the $ 1/n $ factor in the setup of the cost function.

When we compare this with the ordinary least squares result we have

$$ \hat{\boldsymbol{\theta}}_{\mathrm{OLS}} = \left(\boldsymbol{X}^T\boldsymbol{X}\right)^{-1}\boldsymbol{X}^T\boldsymbol{y}, $$

which can lead to singular matrices. However, with the SVD, we can always compute the inverse of the matrix $ \boldsymbol{X}^T\boldsymbol{X} $.

We see that Ridge regression is nothing but the standard OLS with a modified diagonal term added to $ \boldsymbol{X}^T\boldsymbol{X} $. The consequences, in particular for our discussion of the bias-variance tradeoff are rather interesting. We will see that for specific values of $ \lambda $, we may even reduce the variance of the optimal parameters $ \boldsymbol{\theta} $. These topics and other related ones, will be discussed after the more linear algebra oriented analysis here.

When we have discussed the singular value decomposition of the design matrix $ \boldsymbol{X} $, we will in turn perform a more rigorous mathematical discussion of Ridge regression.

The code here is a simple demonstration of how to implement Ridge regression with our own code and compare this with scikit-learn.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn import linear_model

def MSE(y_data,y_model):
    n = np.size(y_model)
    return np.sum((y_data-y_model)**2)/n


# A seed just to ensure that the random numbers are the same for every run.
# Useful for eventual debugging.
np.random.seed(3155)

n = 100
x = np.random.rand(n)
y = np.exp(-x**2) + 1.5 * np.exp(-(x-2)**2)

Maxpolydegree = 20
X = np.zeros((n,Maxpolydegree))
#We include explicitely the intercept column
for degree in range(Maxpolydegree):
    X[:,degree] = x**degree
# We split the data in test and training data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

p = Maxpolydegree
I = np.eye(p,p)
# Decide which values of lambda to use
nlambdas = 6
MSEOwnRidgePredict = np.zeros(nlambdas)
MSERidgePredict = np.zeros(nlambdas)
lambdas = np.logspace(-4, 2, nlambdas)
for i in range(nlambdas):
    lmb = lambdas[i]
    OwnRidgeTheta = np.linalg.pinv(X_train.T @ X_train+lmb*I) @ X_train.T @ y_train
    # Note: we include the intercept column and no scaling
    RegRidge = linear_model.Ridge(lmb,fit_intercept=False)
    RegRidge.fit(X_train,y_train)
    # and then make the prediction
    ytildeOwnRidge = X_train @ OwnRidgeTheta
    ypredictOwnRidge = X_test @ OwnRidgeTheta
    ytildeRidge = RegRidge.predict(X_train)
    ypredictRidge = RegRidge.predict(X_test)
    MSEOwnRidgePredict[i] = MSE(y_test,ypredictOwnRidge)
    MSERidgePredict[i] = MSE(y_test,ypredictRidge)
    print("Theta values for own Ridge implementation")
    print(OwnRidgeTheta)
    print("Theta values for Scikit-Learn Ridge implementation")
    print(RegRidge.coef_)
    print("MSE values for own Ridge implementation")
    print(MSEOwnRidgePredict[i])
    print("MSE values for Scikit-Learn Ridge implementation")
    print(MSERidgePredict[i])

# Now plot the results
plt.figure()
plt.plot(np.log10(lambdas), MSEOwnRidgePredict, 'r', label = 'MSE own Ridge Test')
plt.plot(np.log10(lambdas), MSERidgePredict, 'g', label = 'MSE Ridge Test')

plt.xlabel('log10(lambda)')
plt.ylabel('MSE')
plt.legend()
plt.show()

The results here agree when we force Scikit-Learn's Ridge function to include the first column in our design matrix. We see that the results agree very well. Here we have thus explicitely included the intercept column in the design matrix. What happens if we do not include the intercept in our fit? We will discuss this in more detail next week.

Basic math of the SVD

From standard linear algebra we know that a square matrix $ \boldsymbol{X} $ can be diagonalized if and only if it is a so-called normal matrix, that is if $ \boldsymbol{X}\in {\mathbb{R}}^{n\times n} $ we have $ \boldsymbol{X}\boldsymbol{X}^T=\boldsymbol{X}^T\boldsymbol{X} $ or if $ \boldsymbol{X}\in {\mathbb{C}}^{n\times n} $ we have $ \boldsymbol{X}\boldsymbol{X}^{\dagger}=\boldsymbol{X}^{\dagger}\boldsymbol{X} $. The matrix has then a set of eigenpairs

$$ (\lambda_1,\boldsymbol{u}_1),\dots, (\lambda_n,\boldsymbol{u}_n), $$

and the eigenvalues are given by the diagonal matrix

$$ \boldsymbol{\Sigma}=\mathrm{Diag}(\lambda_1, \dots,\lambda_n). $$

The matrix $ \boldsymbol{X} $ can be written in terms of an orthogonal/unitary transformation $ \boldsymbol{U} $

$$ \boldsymbol{X} = \boldsymbol{U}\boldsymbol{\Sigma}\boldsymbol{V}^T, $$

with $ \boldsymbol{U}\boldsymbol{U}^T=\boldsymbol{I} $ or $ \boldsymbol{U}\boldsymbol{U}^{\dagger}=\boldsymbol{I} $.

Not all square matrices are diagonalizable. A matrix like the one discussed above

$$ \boldsymbol{X} = \begin{bmatrix} 1& -1 \\ 1& -1\\ \end{bmatrix} $$

is not diagonalizable, it is a so-called defective matrix. It is easy to see that the condition $ \boldsymbol{X}\boldsymbol{X}^T=\boldsymbol{X}^T\boldsymbol{X} $ is not fulfilled.

The SVD, a Fantastic Algorithm

However, and this is the strength of the SVD algorithm, any general matrix $ \boldsymbol{X} $ can be decomposed in terms of a diagonal matrix and two orthogonal/unitary matrices. The Singular Value Decompostion (SVD) theorem states that a general $ m\times n $ matrix $ \boldsymbol{X} $ can be written in terms of a diagonal matrix $ \boldsymbol{\Sigma} $ of dimensionality $ m\times n $ and two orthognal matrices $ \boldsymbol{U} $ and $ \boldsymbol{V} $, where the first has dimensionality $ m \times m $ and the last dimensionality $ n\times n $. We have then

$$ \boldsymbol{X} = \boldsymbol{U}\boldsymbol{\Sigma}\boldsymbol{V}^T $$

As an example, the above defective matrix can be decomposed as

$$ \boldsymbol{X} = \frac{1}{\sqrt{2}}\begin{bmatrix} 1& 1 \\ 1& -1\\ \end{bmatrix} \begin{bmatrix} 2& 0 \\ 0& 0\\ \end{bmatrix} \frac{1}{\sqrt{2}}\begin{bmatrix} 1& -1 \\ 1& 1\\ \end{bmatrix}=\boldsymbol{U}\boldsymbol{\Sigma}\boldsymbol{V}^T, $$

with eigenvalues $ \sigma_1=2 $ and $ \sigma_2=0 $. The SVD exits always!

The SVD decomposition (singular values) gives eigenvalues $ \sigma_i\geq\sigma_{i+1} $ for all $ i $ and for dimensions larger than $ i=p $, the eigenvalues (singular values) are zero.

In the general case, where our design matrix $ \boldsymbol{X} $ has dimension $ n\times p $, the matrix is thus decomposed into an $ n\times n $ orthogonal matrix $ \boldsymbol{U} $, a $ p\times p $ orthogonal matrix $ \boldsymbol{V} $ and a diagonal matrix $ \boldsymbol{\Sigma} $ with $ r=\mathrm{min}(n,p) $ singular values $ \sigma_i\geq 0 $ on the main diagonal and zeros filling the rest of the matrix. There are at most $ p $ singular values assuming that $ n > p $. In our regression examples for the nuclear masses and the equation of state this is indeed the case, while for the Ising model we have $ p > n $. These are often cases that lead to near singular or singular matrices.

The columns of $ \boldsymbol{U} $ are called the left singular vectors while the columns of $ \boldsymbol{V} $ are the right singular vectors.

Economy-size SVD

If we assume that $ n > p $, then our matrix $ \boldsymbol{U} $ has dimension $ n \times n $. The last $ n-p $ columns of $ \boldsymbol{U} $ become however irrelevant in our calculations since they are multiplied with the zeros in $ \boldsymbol{\Sigma} $.

The economy-size decomposition removes extra rows or columns of zeros from the diagonal matrix of singular values, $ \boldsymbol{\Sigma} $, along with the columns in either $ \boldsymbol{U} $ or $ \boldsymbol{V} $ that multiply those zeros in the expression. Removing these zeros and columns can improve execution time and reduce storage requirements without compromising the accuracy of the decomposition.

If $ n > p $, we keep only the first $ p $ columns of $ \boldsymbol{U} $ and $ \boldsymbol{\Sigma} $ has dimension $ p\times p $. If $ p > n $, then only the first $ n $ columns of $ \boldsymbol{V} $ are computed and $ \boldsymbol{\Sigma} $ has dimension $ n\times n $. The $ n=p $ case is obvious, we retain the full SVD. In general the economy-size SVD leads to less FLOPS and still conserving the desired accuracy.

Codes for the SVD

import numpy as np
# SVD inversion
def SVD(A):
    ''' Takes as input a numpy matrix A and returns inv(A) based on singular value decomposition (SVD).
    SVD is numerically more stable than the inversion algorithms provided by
    numpy and scipy.linalg at the cost of being slower.
    '''
    U, S, VT = np.linalg.svd(A,full_matrices=True)
    print('test U')
    print( (np.transpose(U) @ U - U @np.transpose(U)))
    print('test VT')
    print( (np.transpose(VT) @ VT - VT @np.transpose(VT)))
    print(U)
    print(S)
    print(VT)

    D = np.zeros((len(U),len(VT)))
    for i in range(0,len(VT)):
        D[i,i]=S[i]
    return U @ D @ VT


X = np.array([ [1.0,-1.0], [1.0,-1.0]])
#X = np.array([[1, 2], [3, 4], [5, 6]])

print(X)
C = SVD(X)
# Print the difference between the original matrix and the SVD one
print(C-X)

The matrix $ \boldsymbol{X} $ has columns that are linearly dependent. The first column is the row-wise sum of the other two columns. The rank of a matrix (the column rank) is the dimension of space spanned by the column vectors. The rank of the matrix is the number of linearly independent columns, in this case just $ 2 $. We see this from the singular values when running the above code. Running the standard inversion algorithm for matrix inversion with $ \boldsymbol{X}^T\boldsymbol{X} $ results in the program terminating due to a singular matrix.

Note about SVD Calculations

The $ U $, $ S $, and $ V $ matrices returned from the svd() function cannot be multiplied directly.

As you can see from the code, the $ S $ vector must be converted into a diagonal matrix. This may cause a problem as the size of the matrices do not fit the rules of matrix multiplication, where the number of columns in a matrix must match the number of rows in the subsequent matrix.

If you wish to include the zero singular values, you will need to resize the matrices and set up a diagonal matrix as done in the above example

Mathematics of the SVD and implications

Let us take a closer look at the mathematics of the SVD and the various implications for machine learning studies.

Our starting point is our design matrix $ \boldsymbol{X} $ of dimension $ n\times p $

$$ \boldsymbol{X}=\begin{bmatrix} x_{0,0} & x_{0,1} & x_{0,2}& \dots & \dots x_{0,p-1}\\ x_{1,0} & x_{1,1} & x_{1,2}& \dots & \dots x_{1,p-1}\\ x_{2,0} & x_{2,1} & x_{2,2}& \dots & \dots x_{2,p-1}\\ \dots & \dots & \dots & \dots \dots & \dots \\ x_{n-2,0} & x_{n-2,1} & x_{n-2,2}& \dots & \dots x_{n-2,p-1}\\ x_{n-1,0} & x_{n-1,1} & x_{n-1,2}& \dots & \dots x_{n-1,p-1}\\ \end{bmatrix}. $$

We can SVD decompose our matrix as

$$ \boldsymbol{X}=\boldsymbol{U}\boldsymbol{\Sigma}\boldsymbol{V}^T, $$

where $ \boldsymbol{U} $ is an orthogonal matrix of dimension $ n\times n $, meaning that $ \boldsymbol{U}\boldsymbol{U}^T=\boldsymbol{U}^T\boldsymbol{U}=\boldsymbol{I}_n $. Here $ \boldsymbol{I}_n $ is the unit matrix of dimension $ n \times n $.

Similarly, $ \boldsymbol{V} $ is an orthogonal matrix of dimension $ p\times p $, meaning that $ \boldsymbol{V}\boldsymbol{V}^T=\boldsymbol{V}^T\boldsymbol{V}=\boldsymbol{I}_p $. Here $ \boldsymbol{I}_n $ is the unit matrix of dimension $ p \times p $.

Finally $ \boldsymbol{\Sigma} $ contains the singular values $ \sigma_i $. This matrix has dimension $ n\times p $ and the singular values $ \sigma_i $ are all positive. The non-zero values are ordered in descending order, that is

$$ \sigma_0 > \sigma_1 > \sigma_2 > \dots > \sigma_{p-1} > 0. $$

All values beyond $ p-1 $ are all zero.

Example Matrix

As an example, consider the following $ 3\times 2 $ example for the matrix $ \boldsymbol{\Sigma} $

$$ \boldsymbol{\Sigma}= \begin{bmatrix} 2& 0 \\ 0 & 1 \\ 0 & 0 \\ \end{bmatrix} $$

The singular values are $ \sigma_0=2 $ and $ \sigma_1=1 $. It is common to rewrite the matrix $ \boldsymbol{\Sigma} $ as

$$ \boldsymbol{\Sigma}= \begin{bmatrix} \boldsymbol{\tilde{\Sigma}}\\ \boldsymbol{0}\\ \end{bmatrix}, $$

where

$$ \boldsymbol{\tilde{\Sigma}}= \begin{bmatrix} 2& 0 \\ 0 & 1 \\ \end{bmatrix}, $$

contains only the singular values. Note also (and we will use this below) that

$$ \boldsymbol{\Sigma}^T\boldsymbol{\Sigma}= \begin{bmatrix} 4& 0 \\ 0 & 1 \\ \end{bmatrix}, $$

which is a $ 2\times 2 $ matrix while

$$ \boldsymbol{\Sigma}\boldsymbol{\Sigma}^T= \begin{bmatrix} 4& 0 & 0\\ 0 & 1 & 0\\ 0 & 0 & 0\\ \end{bmatrix}, $$

is a $ 3\times 3 $ matrix. The last row and column of this last matrix contain only zeros. This will have important consequences for our SVD decomposition of the design matrix.

Setting up the Matrix to be inverted

The matrix that may cause problems for us is $ \boldsymbol{X}^T\boldsymbol{X} $. Using the SVD we can rewrite this matrix as

$$ \boldsymbol{X}^T\boldsymbol{X}=\boldsymbol{V}\boldsymbol{\Sigma}^T\boldsymbol{U}^T\boldsymbol{U}\boldsymbol{\Sigma}\boldsymbol{V}^T, $$

and using the orthogonality of the matrix $ \boldsymbol{U} $ we have

$$ \boldsymbol{X}^T\boldsymbol{X}=\boldsymbol{V}\boldsymbol{\Sigma}^T\boldsymbol{\Sigma}\boldsymbol{V}^T. $$

We define $ \boldsymbol{\Sigma}^T\boldsymbol{\Sigma}=\tilde{\boldsymbol{\Sigma}}^2 $ which is a diagonal matrix containing only the singular values squared. It has dimensionality $ p \times p $.

We can now insert the result for the matrix $ \boldsymbol{X}^T\boldsymbol{X} $ into our equation for ordinary least squares where

$$ \tilde{y}_{\mathrm{OLS}}=\boldsymbol{X}\left(\boldsymbol{X}^T\boldsymbol{X}\right)^{-1}\boldsymbol{X}^T\boldsymbol{y}, $$

and using our SVD decomposition of $ \boldsymbol{X} $ we have

$$ \tilde{y}_{\mathrm{OLS}}=\boldsymbol{U}\boldsymbol{\Sigma}\boldsymbol{V}^T\left(\boldsymbol{V}\tilde{\boldsymbol{\Sigma}}^{2}(\boldsymbol{V}^T\right)^{-1}\boldsymbol{V}\boldsymbol{\Sigma}^T\boldsymbol{U}^T\boldsymbol{y}, $$

which gives us, using the orthogonality of the matrix $ \boldsymbol{V} $,

$$ \tilde{y}_{\mathrm{OLS}}=\sum_{i=0}^{p-1}\boldsymbol{u}_i\boldsymbol{u}^T_i\boldsymbol{y}, $$

which is not the same as $ \tilde{y}_{\mathrm{OLS}}=\boldsymbol{U}\boldsymbol{U}^T\boldsymbol{y} $, which due to the orthogonality of $ \boldsymbol{U} $ would have given us that the model equals the output.

It means that the ordinary least square model (with the optimal parameters) $ \boldsymbol{\tilde{y}} $, corresponds to an orthogonal transformation of the output (or target) vector $ \boldsymbol{y} $ by the vectors of the matrix $ \boldsymbol{U} $. Note that the summation ends at $ p-1 $, that is $ \boldsymbol{\tilde{y}}\ne \boldsymbol{y} $. We can thus not use the orthogonality relation for the matrix $ \boldsymbol{U} $.

Further properties (important for our analyses later)

Let us study again $ \boldsymbol{X}^T\boldsymbol{X} $ in terms of our SVD,

$$ \boldsymbol{X}^T\boldsymbol{X}=\boldsymbol{V}\boldsymbol{\Sigma}^T\boldsymbol{U}^T\boldsymbol{U}\boldsymbol{\Sigma}\boldsymbol{V}^T=\boldsymbol{V}\boldsymbol{\Sigma}^T\boldsymbol{\Sigma}\boldsymbol{V}^T. $$

If we now multiply from the right with $ \boldsymbol{V} $ (using the orthogonality of $ \boldsymbol{V} $) we get

$$ \left(\boldsymbol{X}^T\boldsymbol{X}\right)\boldsymbol{V}=\boldsymbol{V}\boldsymbol{\Sigma}^T\boldsymbol{\Sigma}. $$

This means the vectors $ \boldsymbol{v}_i $ of the orthogonal matrix $ \boldsymbol{V} $ are the eigenvectors of the matrix $ \boldsymbol{X}^T\boldsymbol{X} $ with eigenvalues given by the singular values squared, that is

$$ \left(\boldsymbol{X}^T\boldsymbol{X}\right)\boldsymbol{v}_i=\boldsymbol{v}_i\sigma_i^2. $$

Similarly, if we use the SVD decomposition for the matrix $ \boldsymbol{X}\boldsymbol{X}^T $, we have

$$ \boldsymbol{X}\boldsymbol{X}^T=\boldsymbol{U}\boldsymbol{\Sigma}\boldsymbol{V}^T\boldsymbol{V}\boldsymbol{\Sigma}^T\boldsymbol{U}^T=\boldsymbol{U}\boldsymbol{\Sigma}\boldsymbol{\Sigma}^T\boldsymbol{U}^T. $$

If we now multiply from the right with $ \boldsymbol{U} $ (using the orthogonality of $ \boldsymbol{U} $) we get

$$ \left(\boldsymbol{X}\boldsymbol{X}^T\right)\boldsymbol{U}=\boldsymbol{U}\boldsymbol{\Sigma}\boldsymbol{\Sigma}^T. $$

This means the vectors $ \boldsymbol{u}_i $ of the orthogonal matrix $ \boldsymbol{U} $ are the eigenvectors of the matrix $ \boldsymbol{X}\boldsymbol{X}^T $ with eigenvalues given by the singular values squared, that is

$$ \left(\boldsymbol{X}\boldsymbol{X}^T\right)\boldsymbol{u}_i=\boldsymbol{u}_i\sigma_i^2. $$

Important note: we have defined our design matrix $ \boldsymbol{X} $ to be an $ n\times p $ matrix. In most supervised learning cases we have that $ n \ge p $, and quite often we have $ n >> p $. For linear algebra based methods like ordinary least squares or Ridge regression, this leads to a matrix $ \boldsymbol{X}^T\boldsymbol{X} $ which is small and thereby easier to handle from a computational point of view (in terms of number of floating point operations).

In our lectures, the number of columns will always refer to the number of features in our data set, while the number of rows represents the number of data inputs. Note that in other texts you may find the opposite notation. This has consequences for the definition of for example the covariance matrix and its relation to the SVD.

Back to Ridge and LASSO Regression

or we can state it as

where we have used the definition of a norm-2 vector, that is

$$ \vert\vert \boldsymbol{x}\vert\vert_2 = \sqrt{\sum_i x_i^2}. $$

which leads to the Ridge regression minimization problem where we require that $ \vert\vert \boldsymbol{\theta}\vert\vert_2^2\le t $, where $ t $ is a finite number larger than zero. By defining

$$ C(\boldsymbol{X},\boldsymbol{\theta})=\frac{1}{n}\vert\vert \boldsymbol{y}-\boldsymbol{X}\boldsymbol{\theta}\vert\vert_2^2+\lambda\vert\vert \boldsymbol{\theta}\vert\vert_1, $$

we have a new optimization equation

which leads to Lasso regression. Lasso stands for least absolute shrinkage and selection operator.

Here we have defined the norm-1 as

$$ \vert\vert \boldsymbol{x}\vert\vert_1 = \sum_i \vert x_i\vert. $$

Ridge regression, as discussed above, is nothing but the standard OLS with a modified diagonal term added to $ \boldsymbol{X}^T\boldsymbol{X} $. The consequences, in particular for our discussion of the bias-variance tradeoff are rather interesting. We will see that for specific values of $ \lambda $, we may even reduce the variance of the optimal parameters $ \boldsymbol{\theta} $. These topics and other related ones, will be discussed after the more linear algebra oriented analysis here.

Using our insights about the SVD of the design matrix $ \boldsymbol{X} $ We have already analyzed the OLS solutions in terms of the eigenvectors (the columns) of the right singular value matrix $ \boldsymbol{U} $ as

$$ \tilde{\boldsymbol{y}}_{\mathrm{OLS}}=\boldsymbol{X}\boldsymbol{\theta} =\boldsymbol{U}\boldsymbol{U}^T\boldsymbol{y}. $$

For Ridge regression this becomes

$$ \tilde{\boldsymbol{y}}_{\mathrm{Ridge}}=\boldsymbol{X}\boldsymbol{\theta}_{\mathrm{Ridge}} = \boldsymbol{U\Sigma V^T}\left(\boldsymbol{V}\boldsymbol{\Sigma}^2\boldsymbol{V}^T+\lambda\boldsymbol{I} \right)^{-1}(\boldsymbol{U\Sigma V^T})^T\boldsymbol{y}=\sum_{j=0}^{p-1}\boldsymbol{u}_j\boldsymbol{u}_j^T\frac{\sigma_j^2}{\sigma_j^2+\lambda}\boldsymbol{y}, $$

with the vectors $ \boldsymbol{u}_j $ being the columns of $ \boldsymbol{U} $ from the SVD of the matrix $ \boldsymbol{X} $.

Interpreting the Ridge results

Since $ \lambda \geq 0 $, it means that compared to OLS, we have

$$ \frac{\sigma_j^2}{\sigma_j^2+\lambda} \leq 1. $$

Ridge regression finds the coordinates of $ \boldsymbol{y} $ with respect to the orthonormal basis $ \boldsymbol{U} $, it then shrinks the coordinates by $ \frac{\sigma_j^2}{\sigma_j^2+\lambda} $. Recall that the SVD has eigenvalues ordered in a descending way, that is $ \sigma_i \geq \sigma_{i+1} $.

For small eigenvalues $ \sigma_i $ it means that their contributions become less important, a fact which can be used to reduce the number of degrees of freedom. More about this when we have covered the material on a statistical interpretation of various linear regression methods.

More interpretations

For the sake of simplicity, let us assume that the design matrix is orthonormal, that is

$$ \boldsymbol{X}^T\boldsymbol{X}=(\boldsymbol{X}^T\boldsymbol{X})^{-1} =\boldsymbol{I}. $$

In this case the standard OLS results in

$$ \boldsymbol{\theta}^{\mathrm{OLS}} = \boldsymbol{X}^T\boldsymbol{y}, $$

and

$$ \boldsymbol{\theta}^{\mathrm{Ridge}} = \left(\boldsymbol{I}+\lambda\boldsymbol{I}\right)^{-1}\boldsymbol{X}^T\boldsymbol{y}=\left(1+\lambda\right)^{-1}\boldsymbol{\theta}^{\mathrm{OLS}}, $$

that is the Ridge estimator scales the OLS estimator by the inverse of a factor $ 1+\lambda $, and the Ridge estimator converges to zero when the hyperparameter goes to infinity.

We will come back to more interpretions after we have gone through some of the statistical analysis part.

For more discussions of Ridge and Lasso regression, Wessel van Wieringen's article is highly recommended. Similarly, Mehta et al's article is also recommended.

Deriving the Lasso Regression Equations

Using the matrix-vector expression for Lasso regression, we have the following cost function

$$ C(\boldsymbol{X},\boldsymbol{\theta})=\frac{1}{n}\left\{(\boldsymbol{y}-\boldsymbol{X}\boldsymbol{\theta})^T(\boldsymbol{y}-\boldsymbol{X}\boldsymbol{\theta})\right\}+\lambda\vert\vert\boldsymbol{\theta}\vert\vert_1, $$

Taking the derivative with respect to $ \boldsymbol{\theta} $ and recalling that the derivative of the absolute value is (we drop the boldfaced vector symbol for simplicty)

$$ \frac{d \vert \theta\vert}{d \theta}=\mathrm{sgn}(\theta)=\left\{\begin{array}{cc} 1 & \theta > 0 \\-1 & \theta < 0, \end{array}\right. $$

we have that the derivative of the cost function is

$$ \frac{\partial C(\boldsymbol{X},\boldsymbol{\theta})}{\partial \boldsymbol{\theta}}=-\frac{2}{n}\boldsymbol{X}^T(\boldsymbol{y}-\boldsymbol{X}\boldsymbol{\theta})+\lambda sgn(\boldsymbol{\theta})=0, $$

and reordering we have

$$ \boldsymbol{X}^T\boldsymbol{X}\boldsymbol{\theta}+\frac{n}{2}\lambda sgn(\boldsymbol{\theta})=\boldsymbol{X}^T\boldsymbol{y}. $$

We can redefine $ \lambda $ to absorb the constant $ n/2 $ and we rewrite the last equation as

$$ \boldsymbol{X}^T\boldsymbol{X}\boldsymbol{\theta}+\lambda sgn(\boldsymbol{\theta})=\boldsymbol{X}^T\boldsymbol{y}. $$

This equation does not lead to a nice analytical equation as in either Ridge regression or ordinary least squares. This equation can however be solved by using standard convex optimization algorithms.We will discuss how to code the above methods using gradient descent methods.

Optimization and gradient descent, the central part of any Machine Learning algortithm

Almost every problem in machine learning and data science starts with a dataset $ X $, a model $ g(\theta) $, which is a function of the parameters $ \theta $ and a cost function $ C(X, g(\theta)) $ that allows us to judge how well the model $ g(\theta) $ explains the observations $ X $. The model is fit by finding the values of $ \theta $ that minimize the cost function. Ideally we would be able to solve for $ \theta $ analytically, however this is not possible in general and we must use some approximative/numerical method to compute the minimum.

Reminder on Newton-Raphson's method

Let us quickly remind ourselves how we derive the above method.

Perhaps the most celebrated of all one-dimensional root-finding routines is Newton's method, also called the Newton-Raphson method. This method requires the evaluation of both the function $ f $ and its derivative $ f' $ at arbitrary points. If you can only calculate the derivative numerically and/or your function is not of the smooth type, we normally discourage the use of this method.

The equations

The Newton-Raphson formula consists geometrically of extending the tangent line at a current point until it crosses zero, then setting the next guess to the abscissa of that zero-crossing. The mathematics behind this method is rather simple. Employing a Taylor expansion for $ x $ sufficiently close to the solution $ s $, we have

$$ f(s)=0=f(x)+(s-x)f'(x)+\frac{(s-x)^2}{2}f''(x) +\dots. \label{eq:taylornr} $$

For small enough values of the function and for well-behaved functions, the terms beyond linear are unimportant, hence we obtain

$$ f(x)+(s-x)f'(x)\approx 0, $$

yielding

$$ s\approx x-\frac{f(x)}{f'(x)}. $$

Having in mind an iterative procedure, it is natural to start iterating with

$$ x_{n+1}=x_n-\frac{f(x_n)}{f'(x_n)}. $$

Simple geometric interpretation

The above is Newton-Raphson's method. It has a simple geometric interpretation, namely $ x_{n+1} $ is the point where the tangent from $ (x_n,f(x_n)) $ crosses the $ x $-axis. Close to the solution, Newton-Raphson converges fast to the desired result. However, if we are far from a root, where the higher-order terms in the series are important, the Newton-Raphson formula can give grossly inaccurate results. For instance, the initial guess for the root might be so far from the true root as to let the search interval include a local maximum or minimum of the function. If an iteration places a trial guess near such a local extremum, so that the first derivative nearly vanishes, then Newton-Raphson may fail totally

Extending to more than one variable

Newton's method can be generalized to systems of several non-linear equations and variables. Consider the case with two equations

$$ \begin{array}{cc} f_1(x_1,x_2) &=0\\ f_2(x_1,x_2) &=0,\end{array} $$

which we Taylor expand to obtain

$$ \begin{array}{cc} 0=f_1(x_1+h_1,x_2+h_2)=&f_1(x_1,x_2)+h_1 \partial f_1/\partial x_1+h_2 \partial f_1/\partial x_2+\dots\\ 0=f_2(x_1+h_1,x_2+h_2)=&f_2(x_1,x_2)+h_1 \partial f_2/\partial x_1+h_2 \partial f_2/\partial x_2+\dots \end{array}. $$

Defining the Jacobian matrix $ {\bf \boldsymbol{J}} $ we have

$$ {\bf \boldsymbol{J}}=\left( \begin{array}{cc} \partial f_1/\partial x_1 & \partial f_1/\partial x_2 \\ \partial f_2/\partial x_1 &\partial f_2/\partial x_2 \end{array} \right), $$

we can rephrase Newton's method as

$$ \left(\begin{array}{c} x_1^{n+1} \\ x_2^{n+1} \end{array} \right)= \left(\begin{array}{c} x_1^{n} \\ x_2^{n} \end{array} \right)+ \left(\begin{array}{c} h_1^{n} \\ h_2^{n} \end{array} \right), $$

where we have defined

$$ \left(\begin{array}{c} h_1^{n} \\ h_2^{n} \end{array} \right)= -{\bf \boldsymbol{J}}^{-1} \left(\begin{array}{c} f_1(x_1^{n},x_2^{n}) \\ f_2(x_1^{n},x_2^{n}) \end{array} \right). $$

We need thus to compute the inverse of the Jacobian matrix and it is to understand that difficulties may arise in case $ {\bf \boldsymbol{J}} $ is nearly singular.

It is rather straightforward to extend the above scheme to systems of more than two non-linear equations. In our case, the Jacobian matrix is given by the Hessian that represents the second derivative of cost function.

Steepest descent

The basic idea of gradient descent is that a function $ F(\mathbf{x}) $, $ \mathbf{x} \equiv (x_1,\cdots,x_n) $, decreases fastest if one goes from $ \bf {x} $ in the direction of the negative gradient $ -\nabla F(\mathbf{x}) $.

It can be shown that if

$$ \mathbf{x}_{k+1} = \mathbf{x}_k - \gamma_k \nabla F(\mathbf{x}_k), $$

with $ \gamma_k > 0 $.

For $ \gamma_k $ small enough, then $ F(\mathbf{x}_{k+1}) \leq F(\mathbf{x}_k) $. This means that for a sufficiently small $ \gamma_k $ we are always moving towards smaller function values, i.e a minimum.

More on Steepest descent

The previous observation is the basis of the method of steepest descent, which is also referred to as just gradient descent (GD). One starts with an initial guess $ \mathbf{x}_0 $ for a minimum of $ F $ and computes new approximations according to

$$ \mathbf{x}_{k+1} = \mathbf{x}_k - \gamma_k \nabla F(\mathbf{x}_k), \ \ k \geq 0. $$

The parameter $ \gamma_k $ is often referred to as the step length or the learning rate within the context of Machine Learning.

The ideal

Ideally the sequence $ \{\mathbf{x}_k \}_{k=0} $ converges to a global minimum of the function $ F $. In general we do not know if we are in a global or local minimum. In the special case when $ F $ is a convex function, all local minima are also global minima, so in this case gradient descent can converge to the global solution. The advantage of this scheme is that it is conceptually simple and straightforward to implement. However the method in this form has some severe limitations:

In machine learing we are often faced with non-convex high dimensional cost functions with many local minima. Since GD is deterministic we will get stuck in a local minimum, if the method converges, unless we have a very good intial guess. This also implies that the scheme is sensitive to the chosen initial condition.

Note that the gradient is a function of $ \mathbf{x} = (x_1,\cdots,x_n) $ which makes it expensive to compute numerically.

The sensitiveness of the gradient descent

The gradient descent method is sensitive to the choice of learning rate $ \gamma_k $. This is due to the fact that we are only guaranteed that $ F(\mathbf{x}_{k+1}) \leq F(\mathbf{x}_k) $ for sufficiently small $ \gamma_k $. The problem is to determine an optimal learning rate. If the learning rate is chosen too small the method will take a long time to converge and if it is too large we can experience erratic behavior.

Many of these shortcomings can be alleviated by introducing randomness. One such method is that of Stochastic Gradient Descent (SGD), to be discussed next week.

Convex functions

Ideally we want our cost/loss function to be convex(concave).

First we give the definition of a convex set: A set $ C $ in $ \mathbb{R}^n $ is said to be convex if, for all $ x $ and $ y $ in $ C $ and all $ t \in (0,1) $ , the point $ (1 − t)x + ty $ also belongs to C. Geometrically this means that every point on the line segment connecting $ x $ and $ y $ is in $ C $ as discussed below.

The convex subsets of $ \mathbb{R} $ are the intervals of $ \mathbb{R} $. Examples of convex sets of $ \mathbb{R}^2 $ are the regular polygons (triangles, rectangles, pentagons, etc...).

Convex function

Convex function: Let $ X \subset \mathbb{R}^n $ be a convex set. Assume that the function $ f: X \rightarrow \mathbb{R} $ is continuous, then $ f $ is said to be convex if $$f(tx_1 + (1-t)x_2) \leq tf(x_1) + (1-t)f(x_2) $$ for all $ x_1, x_2 \in X $ and for all $ t \in [0,1] $. If $ \leq $ is replaced with a strict inequaltiy in the definition, we demand $ x_1 \neq x_2 $ and $ t\in(0,1) $ then $ f $ is said to be strictly convex. For a single variable function, convexity means that if you draw a straight line connecting $ f(x_1) $ and $ f(x_2) $, the value of the function on the interval $ [x_1,x_2] $ is always below the line as illustrated below.

Conditions on convex functions

In the following we state first and second-order conditions which ensures convexity of a function $ f $. We write $ D_f $ to denote the domain of $ f $, i.e the subset of $ R^n $ where $ f $ is defined. For more details and proofs we refer to: S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press.

First order condition

Suppose $ f $ is differentiable (i.e $ \nabla f(x) $ is well defined for all $ x $ in the domain of $ f $). Then $ f $ is convex if and only if $ D_f $ is a convex set and $$f(y) \geq f(x) + \nabla f(x)^T (y-x) $$ holds for all $ x,y \in D_f $. This condition means that for a convex function the first order Taylor expansion (right hand side above) at any point a global under estimator of the function. To convince yourself you can make a drawing of $ f(x) = x^2+1 $ and draw the tangent line to $ f(x) $ and note that it is always below the graph.

Second order condition

Assume that $ f $ is twice differentiable, i.e the Hessian matrix exists at each point in $ D_f $. Then $ f $ is convex if and only if $ D_f $ is a convex set and its Hessian is positive semi-definite for all $ x\in D_f $. For a single-variable function this reduces to $ f''(x) \geq 0 $. Geometrically this means that $ f $ has nonnegative curvature everywhere.

This condition is particularly useful since it gives us an procedure for determining if the function under consideration is convex, apart from using the definition.

More on convex functions

The next result is of great importance to us and the reason why we are going on about convex functions. In machine learning we frequently have to minimize a loss/cost function in order to find the best parameters for the model we are considering.

Ideally we want the global minimum (for high-dimensional models it is hard to know if we have local or global minimum). However, if the cost/loss function is convex the following result provides invaluable information:

Any minimum is global for convex functions

Consider the problem of finding $ x \in \mathbb{R}^n $ such that $ f(x) $ is minimal, where $ f $ is convex and differentiable. Then, any point $ x^* $ that satisfies $ \nabla f(x^*) = 0 $ is a global minimum.

This result means that if we know that the cost/loss function is convex and we are able to find a minimum, we are guaranteed that it is a global minimum.

Some simple problems

Show that $ f(x)=x^2 $ is convex for $ x \in \mathbb{R} $ using the definition of convexity. Hint: If you re-write the definition, $ f $ is convex if the following holds for all $ x,y \in D_f $ and any $ \lambda \in [0,1] $ $\lambda f(x)+(1-\lambda)f(y)-f(\lambda x + (1-\lambda) y ) \geq 0$.
Using the second order condition show that the following functions are convex on the specified domain.

$ f(x) = e^x $ is convex for $ x \in \mathbb{R} $.
$ g(x) = -\ln(x) $ is convex for $ x \in (0,\infty) $.

Let $ f(x) = x^2 $ and $ g(x) = e^x $. Show that $ f(g(x)) $ and $ g(f(x)) $ is convex for $ x \in \mathbb{R} $. Also show that if $ f(x) $ is any convex function than $ h(x) = e^{f(x)} $ is convex.
A norm is any function that satisfy the following properties

$ f(\alpha x) = |\alpha| f(x) $ for all $ \alpha \in \mathbb{R} $.
$ f(x+y) \leq f(x) + f(y) $
$ f(x) \leq 0 $ for all $ x \in \mathbb{R}^n $ with equality if and only if $ x = 0 $

Using the definition of convexity, try to show that a function satisfying the properties above is convex (the third condition is not needed to show this).

Revisiting Ordinary Least Squares

We will use linear regression as a case study for the gradient descent methods. Linear regression is a great test case for the gradient descent methods discussed in the lectures since it has several desirable properties such as:

An analytical solution (recall homework sets for week 35).
The gradient can be computed analytically.
The cost function is convex which guarantees that gradient descent converges for small enough learning rates

We revisit an example similar to what we had in the first homework set. We had a function of the type

x = 2*np.random.rand(m,1)
y = 4+3*x+np.random.randn(m,1)

with $ x_i \in [0,1] $ is chosen randomly using a uniform distribution. Additionally we have a stochastic noise chosen according to a normal distribution $ \cal {N}(0,1) $. The linear regression model is given by

$$ h_\theta(x) = \boldsymbol{y} = \theta_0 + \theta_1 x, $$

such that

$$ \boldsymbol{y}_i = \theta_0 + \theta_1 x_i. $$

Gradient descent example

Let $ \mathbf{y} = (y_1,\cdots,y_n)^T $, $ \mathbf{\boldsymbol{y}} = (\boldsymbol{y}_1,\cdots,\boldsymbol{y}_n)^T $ and $ \theta = (\theta_0, \theta_1)^T $

It is convenient to write $ \mathbf{\boldsymbol{y}} = X\theta $ where $ X \in \mathbb{R}^{100 \times 2} $ is the design matrix given by (we keep the intercept here)

$$ X \equiv \begin{bmatrix} 1 & x_1 \\ \vdots & \vdots \\ 1 & x_{100} & \\ \end{bmatrix}. $$

The cost/loss/risk function is given by (

$$ C(\theta) = \frac{1}{n}||X\theta-\mathbf{y}||_{2}^{2} = \frac{1}{n}\sum_{i=1}^{100}\left[ (\theta_0 + \theta_1 x_i)^2 - 2 y_i (\theta_0 + \theta_1 x_i) + y_i^2\right] $$

and we want to find $ \theta $ such that $ C(\theta) $ is minimized.

The derivative of the cost/loss function

Computing $ \partial C(\theta) / \partial \theta_0 $ and $ \partial C(\theta) / \partial \theta_1 $ we can show that the gradient can be written as

$$ \nabla_{\theta} C(\theta) = \frac{2}{n}\begin{bmatrix} \sum_{i=1}^{100} \left(\theta_0+\theta_1x_i-y_i\right) \\ \sum_{i=1}^{100}\left( x_i (\theta_0+\theta_1x_i)-y_ix_i\right) \\ \end{bmatrix} = \frac{2}{n}X^T(X\theta - \mathbf{y}), $$

where $ X $ is the design matrix defined above.

The Hessian matrix

The Hessian matrix of $ C(\theta) $ is given by

$$ \boldsymbol{H} \equiv \begin{bmatrix} \frac{\partial^2 C(\theta)}{\partial \theta_0^2} & \frac{\partial^2 C(\theta)}{\partial \theta_0 \partial \theta_1} \\ \frac{\partial^2 C(\theta)}{\partial \theta_0 \partial \theta_1} & \frac{\partial^2 C(\theta)}{\partial \theta_1^2} & \\ \end{bmatrix} = \frac{2}{n}X^T X. $$

This result implies that $ C(\theta) $ is a convex function since the matrix $ X^T X $ always is positive semi-definite.

Simple program

We can now write a program that minimizes $ C(\theta) $ using the gradient descent method with a constant learning rate $ \gamma $ according to

$$ \theta_{k+1} = \theta_k - \gamma \nabla_\theta C(\theta_k), \ k=0,1,\cdots $$

We can use the expression we computed for the gradient and let use a $ \theta_0 $ be chosen randomly and let $ \gamma = 0.001 $. Stop iterating when $ ||\nabla_\theta C(\theta_k) || \leq \epsilon = 10^{-8} $. Note that the code below does not include the latter stop criterion.

And finally we can compare our solution for $ \theta $ with the analytic result given by $ \theta= (X^TX)^{-1} X^T \mathbf{y} $.

Gradient Descent Example

Here our simple example

# Importing various packages
from random import random, seed
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import cm
from matplotlib.ticker import LinearLocator, FormatStrFormatter
import sys

# the number of datapoints
n = 100
x = 2*np.random.rand(n,1)
y = 4+3*x+np.random.randn(n,1)

X = np.c_[np.ones((n,1)), x]
# Hessian matrix
H = (2.0/n)* X.T @ X
# Get the eigenvalues
EigValues, EigVectors = np.linalg.eig(H)
print(f"Eigenvalues of Hessian Matrix:{EigValues}")

theta_linreg = np.linalg.inv(X.T @ X) @ X.T @ y
print(theta_linreg)
theta = np.random.randn(2,1)

eta = 1.0/np.max(EigValues)
Niterations = 1000

for iter in range(Niterations):
    gradient = (2.0/n)*X.T @ (X @ theta-y)
    theta -= eta*gradient

print(theta)
xnew = np.array([[0],[2]])
xbnew = np.c_[np.ones((2,1)), xnew]
ypredict = xbnew.dot(theta)
ypredict2 = xbnew.dot(theta_linreg)
plt.plot(xnew, ypredict, "r-")
plt.plot(xnew, ypredict2, "b-")
plt.plot(x, y ,'ro')
plt.axis([0,2.0,0, 15.0])
plt.xlabel(r'$x$')
plt.ylabel(r'$y$')
plt.title(r'Gradient descent example')
plt.show()

Gradient descent and Ridge

We have also discussed Ridge regression where the loss function contains a regularized term given by the $ L_2 $ norm of $ \theta $,

$$ C_{\text{ridge}}(\theta) = \frac{1}{n}||X\theta -\mathbf{y}||^2 + \lambda ||\theta||^2, \ \lambda \geq 0. $$

In order to minimize $ C_{\text{ridge}}(\theta) $ using GD we adjust the gradient as follows

$$ \nabla_\theta C_{\text{ridge}}(\theta) = \frac{2}{n}\begin{bmatrix} \sum_{i=1}^{100} \left(\theta_0+\theta_1x_i-y_i\right) \\ \sum_{i=1}^{100}\left( x_i (\theta_0+\theta_1x_i)-y_ix_i\right) \\ \end{bmatrix} + 2\lambda\begin{bmatrix} \theta_0 \\ \theta_1\end{bmatrix} = 2 (\frac{1}{n}X^T(X\theta - \mathbf{y})+\lambda \theta). $$

We can easily extend our program to minimize $ C_{\text{ridge}}(\theta) $ using gradient descent and compare with the analytical solution given by

$$ \theta_{\text{ridge}} = \left(X^T X + n\lambda I_{2 \times 2} \right)^{-1} X^T \mathbf{y}. $$

The Hessian matrix for Ridge Regression

The Hessian matrix of Ridge Regression for our simple example is given by

This implies that the Hessian matrix is positive definite, hence the stationary point is a minimum. Note that the Ridge cost function is convex being a sum of two convex functions. Therefore, the stationary point is a global minimum of this function.

Program example for gradient descent with Ridge Regression

from random import random, seed
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import cm
from matplotlib.ticker import LinearLocator, FormatStrFormatter
import sys

# the number of datapoints
n = 100
x = 2*np.random.rand(n,1)
y = 4+3*x+np.random.randn(n,1)

X = np.c_[np.ones((n,1)), x]
XT_X = X.T @ X

#Ridge parameter lambda
lmbda  = 0.001
Id = n*lmbda* np.eye(XT_X.shape[0])

# Hessian matrix
H = (2.0/n)* XT_X+2*lmbda* np.eye(XT_X.shape[0])
# Get the eigenvalues
EigValues, EigVectors = np.linalg.eig(H)
print(f"Eigenvalues of Hessian Matrix:{EigValues}")


theta_linreg = np.linalg.inv(XT_X+Id) @ X.T @ y
print(theta_linreg)
# Start plain gradient descent
theta = np.random.randn(2,1)

eta = 1.0/np.max(EigValues)
Niterations = 100

for iter in range(Niterations):
    gradients = 2.0/n*X.T @ (X @ (theta)-y)+2*lmbda*theta
    theta -= eta*gradients

print(theta)
ypredict = X @ theta
ypredict2 = X @ theta_linreg
plt.plot(x, ypredict, "r-")
plt.plot(x, ypredict2, "b-")
plt.plot(x, y ,'ro')
plt.axis([0,2.0,0, 15.0])
plt.xlabel(r'$x$')
plt.ylabel(r'$y$')
plt.title(r'Gradient descent example for Ridge')
plt.show()

Using gradient descent methods, limitations

Gradient descent (GD) finds local minima of our function. Since the GD algorithm is deterministic, if it converges, it will converge to a local minimum of our cost/loss/risk function. Because in ML we are often dealing with extremely rugged landscapes with many local minima, this can lead to poor performance.
GD is sensitive to initial conditions. One consequence of the local nature of GD is that initial conditions matter. Depending on where one starts, one will end up at a different local minima. Therefore, it is very important to think about how one initializes the training process. This is true for GD as well as more complicated variants of GD.
Gradients are computationally expensive to calculate for large datasets. In many cases in statistics and ML, the cost/loss/risk function is a sum of terms, with one term for each data point. For example, in linear regression, $ E \propto \sum_{i=1}^n (y_i - \mathbf{w}^T\cdot\mathbf{x}_i)^2 $; for logistic regression, the square error is replaced by the cross entropy. To calculate the gradient we have to sum over all $ n $ data points. Doing this at every GD step becomes extremely computationally expensive. An ingenious solution to this, is to calculate the gradients using small subsets of the data called "mini batches". This has the added benefit of introducing stochasticity into our algorithm.
GD is very sensitive to choices of learning rates. GD is extremely sensitive to the choice of learning rates. If the learning rate is very small, the training process take an extremely long time. For larger learning rates, GD can diverge and give poor results. Furthermore, depending on what the local landscape looks like, we have to modify the learning rates to ensure convergence. Ideally, we would adaptively choose the learning rates to match the landscape.
GD treats all directions in parameter space uniformly. Another major drawback of GD is that unlike Newton's method, the learning rate for GD is the same in all directions in parameter space. For this reason, the maximum learning rate is set by the behavior of the steepest direction and this can significantly slow down training. Ideally, we would like to take large steps in flat directions and small steps in steep directions. Since we are exploring rugged landscapes where curvatures change, this requires us to keep track of not only the gradient but second derivatives. The ideal scenario would be to calculate the Hessian but this proves to be too computationally expensive.
GD can take exponential time to escape saddle points, even with random initialization. As we mentioned, GD is extremely sensitive to initial condition since it determines the particular local minimum GD would eventually reach. However, even with a good initialization scheme, through the introduction of randomness, GD can still take exponential time to escape saddle points.

Material for lab sessions sessions Tuesday and Wednesday

The material here contains a summary of the lecture on Monday and discussion of SVD, Ridge and Lasso regression with examples

Linear Regression and the SVD

We used the SVD to analyse the matrix to invert in ordinary lineat regression

Since the matrices here have dimension $ p\times p $, with $ p $ corresponding to the singular values, we defined last week the matrix

$$ \boldsymbol{\Sigma}^T\boldsymbol{\Sigma} = \begin{bmatrix} \tilde{\boldsymbol{\Sigma}} & \boldsymbol{0}\\ \end{bmatrix}\begin{bmatrix} \tilde{\boldsymbol{\Sigma}} \\ \boldsymbol{0}\end{bmatrix}, $$

where the tilde-matrix $ \tilde{\boldsymbol{\Sigma}} $ is a matrix of dimension $ p\times p $ containing only the singular values $ \sigma_i $, that is

$$ \tilde{\boldsymbol{\Sigma}}=\begin{bmatrix} \sigma_0 & 0 & 0 & \dots & 0 & 0 \\ 0 & \sigma_1 & 0 & \dots & 0 & 0 \\ 0 & 0 & \sigma_2 & \dots & 0 & 0 \\ 0 & 0 & 0 & \dots & \sigma_{p-2} & 0 \\ 0 & 0 & 0 & \dots & 0 & \sigma_{p-1} \\ \end{bmatrix}, $$

meaning we can write

$$ \boldsymbol{X}^T\boldsymbol{X}=\boldsymbol{V}\tilde{\boldsymbol{\Sigma}}^2\boldsymbol{V}^T. $$

Multiplying from the right with $ \boldsymbol{V} $ (using the orthogonality of $ \boldsymbol{V} $) we get

$$ \left(\boldsymbol{X}^T\boldsymbol{X}\right)\boldsymbol{V}=\boldsymbol{V}\tilde{\boldsymbol{\Sigma}}^2. $$

What does it mean?

$$ \left(\boldsymbol{X}^T\boldsymbol{X}\right)\boldsymbol{v}_i=\boldsymbol{v}_i\sigma_i^2. $$

In other words, each non-zero singular value of $ \boldsymbol{X} $ is a positive square root of an eigenvalue of $ \boldsymbol{X}^T\boldsymbol{X} $. It means also that the columns of $ \boldsymbol{V} $ are the eigenvectors of $ \boldsymbol{X}^T\boldsymbol{X} $. Since we have ordered the singular values of $ \boldsymbol{X} $ in a descending order, it means that the column vectors $ \boldsymbol{v}_i $ are hierarchically ordered by how much correlation they encode from the columns of $ \boldsymbol{X} $.

Note that these are also the eigenvectors and eigenvalues of the Hessian matrix.

Ridge and LASSO Regression

or we can state it as

where we have used the definition of a norm-2 vector, that is

$$ \vert\vert \boldsymbol{x}\vert\vert_2 = \sqrt{\sum_i x_i^2}. $$

From OLS to Ridge and Lasso

which leads to the Ridge regression minimization problem where we require that $ \vert\vert \boldsymbol{\theta}\vert\vert_2^2\le t $, where $ t $ is a finite number larger than zero. We do not include such a constraints in the discussions here.

By defining

$$ C(\boldsymbol{X},\boldsymbol{\theta})=\frac{1}{n}\vert\vert \boldsymbol{y}-\boldsymbol{X}\boldsymbol{\theta}\vert\vert_2^2+\lambda\vert\vert \boldsymbol{\theta}\vert\vert_1, $$

we have a new optimization equation

which leads to Lasso regression. Lasso stands for least absolute shrinkage and selection operator.

Here we have defined the norm-1 as

$$ \vert\vert \boldsymbol{x}\vert\vert_1 = \sum_i \vert x_i\vert. $$

Deriving the Ridge Regression Equations

Using the matrix-vector expression for Ridge regression and dropping the parameter $ 1/n $ in front of the standard means squared error equation, we have

$$ \hat{\boldsymbol{\theta}}_{\mathrm{Ridge}} = \left(\boldsymbol{X}^T\boldsymbol{X}+\lambda\boldsymbol{I}\right)^{-1}\boldsymbol{X}^T\boldsymbol{y}, $$

with $ \boldsymbol{I} $ being a $ p\times p $ identity matrix with the constraint that

$$ \sum_{i=0}^{p-1} \theta_i^2 \leq t, $$

with $ t $ a finite positive number.

Note on Scikit-Learn

Note well that a library like Scikit-Learn does not include the $ 1/n $ factor in the expression for the mean-squared error. If you include it, the optimal parameter $ \theta $ becomes

$$ \hat{\boldsymbol{\theta}}_{\mathrm{Ridge}} = \left(\boldsymbol{X}^T\boldsymbol{X}+n\lambda\boldsymbol{I}\right)^{-1}\boldsymbol{X}^T\boldsymbol{y}. $$

In our codes where we compare our own codes with Scikit-Learn, we do thus not include the $ 1/n $ factor in the cost function.

Comparison with OLS

When we compare this with the ordinary least squares result we have

$$ \hat{\boldsymbol{\theta}}_{\mathrm{OLS}} = \left(\boldsymbol{X}^T\boldsymbol{X}\right)^{-1}\boldsymbol{X}^T\boldsymbol{y}, $$

which can lead to singular matrices. However, with the SVD, we can always compute the inverse of the matrix $ \boldsymbol{X}^T\boldsymbol{X} $.

SVD analysis

$$ \tilde{\boldsymbol{y}}_{\mathrm{OLS}}=\boldsymbol{X}\boldsymbol{\theta} =\boldsymbol{U}\boldsymbol{U}^T\boldsymbol{y}. $$

For Ridge regression this becomes

with the vectors $ \boldsymbol{u}_j $ being the columns of $ \boldsymbol{U} $ from the SVD of the matrix $ \boldsymbol{X} $.

Interpreting the Ridge results

Since $ \lambda \geq 0 $, it means that compared to OLS, we have

$$ \frac{\sigma_j^2}{\sigma_j^2+\lambda} \leq 1. $$

For small eigenvalues $ \sigma_i $ it means that their contributions become less important, a fact which can be used to reduce the number of degrees of freedom.

More interpretations

For the sake of simplicity, let us assume that the design matrix is orthonormal, that is

$$ \boldsymbol{X}^T\boldsymbol{X}=(\boldsymbol{X}^T\boldsymbol{X})^{-1} =\boldsymbol{I}. $$

In this case the standard OLS results in

$$ \boldsymbol{\theta}^{\mathrm{OLS}} = \boldsymbol{X}^T\boldsymbol{y}=\sum_{i=0}^{n-1}\boldsymbol{u}_i\boldsymbol{u}_i^T\boldsymbol{y}, $$

and

$$ \boldsymbol{\theta}^{\mathrm{Ridge}} = \left(\boldsymbol{I}+\lambda\boldsymbol{I}\right)^{-1}\boldsymbol{X}^T\boldsymbol{y}=\left(1+\lambda\right)^{-1}\boldsymbol{\theta}^{\mathrm{OLS}}, $$

that is the Ridge estimator scales the OLS estimator by the inverse of a factor $ 1+\lambda $, and the Ridge estimator converges to zero when the hyperparameter goes to infinity.

We will come back to more interpreations after we have gone through some of the statistical analysis part.

For more discussions of Ridge and Lasso regression, Wessel van Wieringen's article is highly recommended. Similarly, Mehta et al's article is also recommended.

Deriving the Lasso Regression Equations

Using the matrix-vector expression for Lasso regression, we have the following cost function

Taking the derivative with respect to $ \boldsymbol{\theta} $ and recalling that the derivative of the absolute value is (we drop the boldfaced vector symbol for simplicity)

$$ \frac{d \vert \theta\vert}{d \theta}=\mathrm{sgn}(\theta)=\left\{\begin{array}{cc} 1 & \theta > 0 \\-1 & \theta < 0, \end{array}\right. $$

we have that the derivative of the cost function is

and reordering we have

$$ \boldsymbol{X}^T\boldsymbol{X}\boldsymbol{\theta}+\lambda sgn(\boldsymbol{\theta})=\boldsymbol{X}^T\boldsymbol{y}. $$

This equation does not lead to a nice analytical equation as in Ridge regression or ordinary least squares. We have absorbed the factor $ 2/n $ in a redefinition of the parameter $ \lambda $. We will solve this type of problems using libraries like scikit-learn and using our own gradient descent code in project 1.

Simple example to illustrate Ordinary Least Squares, Ridge and Lasso Regression

Let us assume that our design matrix is given by unit (identity) matrix, that is a square diagonal matrix with ones only along the diagonal. In this case we have an equal number of rows and columns $ n=p $.

Our model approximation is just $ \tilde{\boldsymbol{y}}=\boldsymbol{\theta} $ and the mean squared error and thereby the cost function for ordinary least sqquares (OLS) is then (we drop the term $ 1/n $)

$$ C(\boldsymbol{\theta})=\sum_{i=0}^{p-1}(y_i-\theta_i)^2, $$

and minimizing we have that

$$ \hat{\theta}_i^{\mathrm{OLS}} = y_i. $$

Ridge Regression

For Ridge regression our cost function is

$$ C(\boldsymbol{\theta})=\sum_{i=0}^{p-1}(y_i-\theta_i)^2+\lambda\sum_{i=0}^{p-1}\theta_i^2, $$

and minimizing we have that

$$ \hat{\theta}_i^{\mathrm{Ridge}} = \frac{y_i}{1+\lambda}. $$

Lasso Regression

For Lasso regression our cost function is

$$ C(\boldsymbol{\theta})=\sum_{i=0}^{p-1}(y_i-\theta_i)^2+\lambda\sum_{i=0}^{p-1}\vert\theta_i\vert=\sum_{i=0}^{p-1}(y_i-\theta_i)^2+\lambda\sum_{i=0}^{p-1}\sqrt{\theta_i^2}, $$

and minimizing we have that

$$ -2\sum_{i=0}^{p-1}(y_i-\theta_i)+\lambda \sum_{i=0}^{p-1}\frac{(\theta_i)}{\vert\theta_i\vert}=0, $$

which leads to

$$ \hat{\boldsymbol{\theta}}_i^{\mathrm{Lasso}} = \left\{\begin{array}{ccc}y_i-\frac{\lambda}{2} &\mathrm{if} & y_i> \frac{\lambda}{2}\\ y_i+\frac{\lambda}{2} &\mathrm{if} & y_i < -\frac{\lambda}{2}\\ 0 &\mathrm{if} & \vert y_i\vert\le \frac{\lambda}{2}\end{array}\right.\\. $$

Plotting these results shows clearly that Lasso regression suppresses (sets to zero) values of $ \theta_i $ for specific values of $ \lambda $. Ridge regression reduces on the other hand the values of $ \theta_i $ as function of $ \lambda $.

Yet another Example

Let us assume we have a data set with outputs/targets given by the vector

$$ \boldsymbol{y}=\begin{bmatrix}4 \\ 2 \\3\end{bmatrix}, $$

and our inputs as a $ 3\times 2 $ design matrix

$$ \boldsymbol{X}=\begin{bmatrix}2 & 0\\ 0 & 1 \\ 0 & 0\end{bmatrix}, $$

meaning that we have two features and two unknown parameters $ \theta_0 $ and $ \theta_1 $ to be determined either by ordinary least squares, Ridge or Lasso regression.

The OLS case

For ordinary least squares (OLS) we know that the optimal solution is

$$ \hat{\boldsymbol{\theta}}^{\mathrm{OLS}}=\left( \boldsymbol{X}^T\boldsymbol{X}\right)^{-1}\boldsymbol{X}^T\boldsymbol{y}. $$

Inserting the above values we obtain that

$$ \hat{\boldsymbol{\theta}}^{\mathrm{OLS}}=\begin{bmatrix}2 \\ 2\end{bmatrix}, $$

The code which implements this simpler case is presented after the discussion of Ridge and Lasso.

The Ridge case

For Ridge regression we have

$$ \hat{\boldsymbol{\theta}}^{\mathrm{Ridge}}=\left( \boldsymbol{X}^T\boldsymbol{X}+\lambda\boldsymbol{I}\right)^{-1}\boldsymbol{X}^T\boldsymbol{y}. $$

Inserting the above values we obtain that

$$ \hat{\boldsymbol{\theta}}^{\mathrm{Ridge}}=\begin{bmatrix}\frac{8}{4+\lambda} \\ \frac{2}{1+\lambda}\end{bmatrix}, $$

There is normally a constraint on the value of $ \vert\vert \boldsymbol{\theta}\vert\vert_2 $ via the parameter $ \lambda $. Let us for simplicity assume that $ \theta_0^2+\theta_1^2=1 $ as constraint. This will allow us to find an expression for the optimal values of $ \theta $ and $ \lambda $.

To see this, let us write the cost function for Ridge regression.

Writing the Cost Function

We define the MSE without the $ 1/n $ factor and have then, using that

$$ \boldsymbol{X}\boldsymbol{\theta}=\begin{bmatrix} 2\theta_0 \\ \theta_1 \\0 \end{bmatrix}, $$ $$ C(\boldsymbol{\theta})=(4-2\theta_0)^2+(2-\theta_1)^2+\lambda(\theta_0^2+\theta_1^2), $$

and taking the derivative with respect to $ \theta_0 $ we get

$$ \theta_0=\frac{8}{4+\lambda}, $$

and for $ \theta_1 $ we obtain

$$ \theta_1=\frac{2}{1+\lambda}, $$

Using the constraint for $ \theta_0^2+\theta_1^2=1 $ we can constrain $ \lambda $ by solving

$$ \left(\frac{8}{4+\lambda}\right)^2+\left(\frac{2}{1+\lambda}\right)^2=1, $$

which gives $ \lambda=4.571 $ and $ \theta_0=0.933 $ and $ \theta_1=0.359 $.

Lasso case

For Lasso we need now, keeping a constraint on $ \vert\theta_0\vert+\vert\theta_1\vert=1 $, to take the derivative of the absolute values of $ \theta_0 $ and $ \theta_1 $. This gives us the following derivatives of the cost function

$$ C(\boldsymbol{\theta})=(4-2\theta_0)^2+(2-\theta_1)^2+\lambda(\vert\theta_0\vert+\vert\theta_1\vert), $$ $$ \frac{\partial C(\boldsymbol{\theta})}{\partial \theta_0}=-4(4-2\theta_0)+\lambda\mathrm{sgn}(\theta_0)=0, $$

and

$$ \frac{\partial C(\boldsymbol{\theta})}{\partial \theta_1}=-2(2-\theta_1)+\lambda\mathrm{sgn}(\theta_1)=0. $$

We have now four cases to solve besides the trivial cases $ \theta_0 $ and/or $ \theta_1 $ are zero, namely

$ \theta_0 > 0 $ and $ \theta_1 > 0 $,
$ \theta_0 > 0 $ and $ \theta_1 < 0 $,
$ \theta_0 < 0 $ and $ \theta_1 > 0 $,
$ \theta_0 < 0 $ and $ \theta_1 < 0 $.

The first Case

If we consider the first case, we have then

$$ -4(4-2\theta_0)+\lambda=0, $$

and

$$ -2(2-\theta_1)+\lambda=0. $$

which yields

$$ \theta_0=\frac{16+\lambda}{8}, $$

and

$$ \theta_1=\frac{4+\lambda}{2}. $$

Using the constraint on $ \theta_0 $ and $ \theta_1 $ we can then find the optimal value of $ \lambda $ for the different cases. We leave this as an exercise to you.

Simple code for solving the above problem

Here we set up the OLS, Ridge and Lasso functionality in order to study the above example. Note that here we have opted for a set of values of $ \lambda $, meaning that we need to perform a search in order to find the optimal values.

First we study and compare the OLS and Ridge results. The next code compares all three methods.

import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

def R2(y_data, y_model):
    return 1 - np.sum((y_data - y_model) ** 2) / np.sum((y_data - np.mean(y_data)) ** 2)
def MSE(y_data,y_model):
    n = np.size(y_model)
    return np.sum((y_data-y_model)**2)/n


# A seed just to ensure that the random numbers are the same for every run.
# Useful for eventual debugging.

X = np.array( [ [ 2, 0], [0, 1], [0,0]])
y = np.array( [4, 2, 3])


# matrix inversion to find beta
OLSbeta = np.linalg.inv(X.T @ X) @ X.T @ y
print(OLSbeta)
# and then make the prediction
ytildeOLS = X @ OLSbeta
print("Training MSE for OLS")
print(MSE(y,ytildeOLS))
ypredictOLS = X @ OLSbeta

# Repeat now for Ridge regression and various values of the regularization parameter
I = np.eye(2,2)
# Decide which values of lambda to use
nlambdas = 100
MSEPredict = np.zeros(nlambdas)
lambdas = np.logspace(-4, 4, nlambdas)
for i in range(nlambdas):
    lmb = lambdas[i]
    Ridgebeta = np.linalg.inv(X.T @ X+lmb*I) @ X.T @ y
#    print(Ridgebeta)
    # and then make the prediction
    ypredictRidge = X @ Ridgebeta
    MSEPredict[i] = MSE(y,ypredictRidge)
#    print(MSEPredict[i])
    # Now plot the results
plt.figure()
plt.plot(np.log10(lambdas), MSEPredict, 'r--', label = 'MSE Ridge Train')
plt.xlabel('log10(lambda)')
plt.ylabel('MSE')
plt.legend()
plt.show()

We see here that we reach a plateau. What is actually happening?

With Lasso Regression

import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import linear_model

def R2(y_data, y_model):
    return 1 - np.sum((y_data - y_model) ** 2) / np.sum((y_data - np.mean(y_data)) ** 2)
def MSE(y_data,y_model):
    n = np.size(y_model)
    return np.sum((y_data-y_model)**2)/n


# A seed just to ensure that the random numbers are the same for every run.
# Useful for eventual debugging.

X = np.array( [ [ 2, 0], [0, 1], [0,0]])
y = np.array( [4, 2, 3])


# matrix inversion to find beta
OLSbeta = np.linalg.inv(X.T @ X) @ X.T @ y
print(OLSbeta)
# and then make the prediction
ytildeOLS = X @ OLSbeta
print("Training MSE for OLS")
print(MSE(y,ytildeOLS))
ypredictOLS = X @ OLSbeta

# Repeat now for Ridge regression and various values of the regularization parameter
I = np.eye(2,2)
# Decide which values of lambda to use
nlambdas = 100
MSERidgePredict = np.zeros(nlambdas)
MSELassoPredict = np.zeros(nlambdas)
lambdas = np.logspace(-4, 4, nlambdas)
for i in range(nlambdas):
    lmb = lambdas[i]
    Ridgebeta = np.linalg.inv(X.T @ X+lmb*I) @ X.T @ y
    print(Ridgebeta)
    # and then make the prediction
    ypredictRidge = X @ Ridgebeta
    MSERidgePredict[i] = MSE(y,ypredictRidge)
    RegLasso = linear_model.Lasso(lmb,fit_intercept=False)
    RegLasso.fit(X,y)
    ypredictLasso = RegLasso.predict(X)
    print(RegLasso.coef_)
    MSELassoPredict[i] = MSE(y,ypredictLasso)
# Now plot the results
plt.figure()
plt.plot(np.log10(lambdas), MSERidgePredict, 'r--', label = 'MSE Ridge Train')
plt.plot(np.log10(lambdas), MSELassoPredict, 'r--', label = 'MSE Lasso Train')
plt.xlabel('log10(lambda)')
plt.ylabel('MSE')
plt.legend()
plt.show()

Another Example, now with a polynomial fit

import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn import linear_model

def R2(y_data, y_model):
    return 1 - np.sum((y_data - y_model) ** 2) / np.sum((y_data - np.mean(y_data)) ** 2)
def MSE(y_data,y_model):
    n = np.size(y_model)
    return np.sum((y_data-y_model)**2)/n


# A seed just to ensure that the random numbers are the same for every run.
# Useful for eventual debugging.
np.random.seed(3155)

x = np.random.rand(100)
y = 2.0+5*x*x+0.1*np.random.randn(100)

# number of features p (here degree of polynomial
p = 3
#  The design matrix now as function of a given polynomial
X = np.zeros((len(x),p))
X[:,0] = 1.0
X[:,1] = x
X[:,2] = x*x
# We split the data in test and training data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# matrix inversion to find beta
OLSbeta = np.linalg.inv(X_train.T @ X_train) @ X_train.T @ y_train
print(OLSbeta)
# and then make the prediction
ytildeOLS = X_train @ OLSbeta
print("Training MSE for OLS")
print(MSE(y_train,ytildeOLS))
ypredictOLS = X_test @ OLSbeta
print("Test MSE OLS")
print(MSE(y_test,ypredictOLS))

# Repeat now for Lasso and Ridge regression and various values of the regularization parameter
I = np.eye(p,p)
# Decide which values of lambda to use
nlambdas = 100
MSEPredict = np.zeros(nlambdas)
MSETrain = np.zeros(nlambdas)
MSELassoPredict = np.zeros(nlambdas)
MSELassoTrain = np.zeros(nlambdas)
lambdas = np.logspace(-4, 4, nlambdas)
for i in range(nlambdas):
    lmb = lambdas[i]
    Ridgebeta = np.linalg.inv(X_train.T @ X_train+lmb*I) @ X_train.T @ y_train
    # include lasso using Scikit-Learn
    RegLasso = linear_model.Lasso(lmb,fit_intercept=False)
    RegLasso.fit(X_train,y_train)
    # and then make the prediction
    ytildeRidge = X_train @ Ridgebeta
    ypredictRidge = X_test @ Ridgebeta
    ytildeLasso = RegLasso.predict(X_train)
    ypredictLasso = RegLasso.predict(X_test)
    MSEPredict[i] = MSE(y_test,ypredictRidge)
    MSETrain[i] = MSE(y_train,ytildeRidge)
    MSELassoPredict[i] = MSE(y_test,ypredictLasso)
    MSELassoTrain[i] = MSE(y_train,ytildeLasso)

# Now plot the results
plt.figure()
plt.plot(np.log10(lambdas), MSETrain, label = 'MSE Ridge train')
plt.plot(np.log10(lambdas), MSEPredict, 'r--', label = 'MSE Ridge Test')
plt.plot(np.log10(lambdas), MSELassoTrain, label = 'MSE Lasso train')
plt.plot(np.log10(lambdas), MSELassoPredict, 'r--', label = 'MSE Lasso Test')

plt.xlabel('log10(lambda)')
plt.ylabel('MSE')
plt.legend()
plt.show()