4. Ridge and Lasso Regression#
4.1. Mathematical Interpretation of Ordinary Least Squares#
What is presented here is a mathematical analysis of various regression algorithms (ordinary least squares, Ridge and Lasso Regression). The analysis is based on an important algorithm in linear algebra, the so-called Singular Value Decomposition (SVD).
We have shown that in ordinary least squares (OLS) the optimal parameters
The hat over
This means that our best model is defined as
We now define a matrix
We can rewrite
The matrix
We have defined the residual error as
The residual errors are then the projections of
If the matrix
In this case the matrix
and we have the obvious case
This serves also as a useful test of our codes.
4.2. The singular value decomposition#
The examples we have looked at so far are cases where we normally can
invert the matrix
As we will also see in the first project, this may however not the be case in general and a standard matrix inversion algorithm based on say LU, QR or Cholesky decomposition may lead to singularities. We will see examples of this below.
There is however a way to circumvent this problem and also gain some insights about the ordinary least squares approach, and later shrinkage methods like Ridge and Lasso regressions.
This is given by the Singular Value Decomposition (SVD) algorithm, perhaps the most powerful linear algebra algorithm. The SVD provides a numerically stable matrix decomposition that is used in a large swath of applications and the decomposition is always stable numerically.
In machine learning it plays a central role in dealing with for example design matrices that may be near singular or singular. Furthermore, as we will see here, the singular values can be related to the covariance matrix (and thereby the correlation matrix) and in turn the variance of a given quantity. It plays also an important role in the principal component analysis where high-dimensional data can be reduced to the statistically relevant features.
One of the typical problems we encounter with linear regression, in particular
when the matrix
This means that the matrix may be rank deficient and it is basically impossible to
model the data using linear regression. As an example, consider the matrix
The columns of
Super-collinearity of an
We see easily that
If our design matrix
has linearly dependent column vectors, we will not be able to compute the inverse
of
A cheap ad hoc approach is simply to add a small diagonal component to the matrix to invert, that is we change
where
4.3. Basic math of the SVD#
From standard linear algebra we know that a square matrix
and the eigenvalues are given by the diagonal matrix
The matrix
with
Not all square matrices are diagonalizable. A matrix like the one discussed above
is not diagonalizable, it is a so-called defective matrix. It is easy to see that the condition
However, and this is the strength of the SVD algorithm, any general
matrix
As an example, the above defective matrix can be decomposed as
with eigenvalues
The SVD
decomposition (singular values) gives eigenvalues
In the general case, where our design matrix
The columns of
If we assume that
The economy-size decomposition removes extra rows or columns of zeros
from the diagonal matrix of singular values,
If
4.4. Codes for the SVD#
import numpy as np
# SVD inversion
def SVD(A):
''' Takes as input a numpy matrix A and returns inv(A) based on singular value decomposition (SVD).
SVD is numerically more stable than the inversion algorithms provided by
numpy and scipy.linalg at the cost of being slower.
'''
U, S, VT = np.linalg.svd(A,full_matrices=True)
print('test U')
print( (np.transpose(U) @ U - U @np.transpose(U)))
print('test VT')
print( (np.transpose(VT) @ VT - VT @np.transpose(VT)))
print(U)
print(S)
print(VT)
D = np.zeros((len(U),len(VT)))
for i in range(0,len(VT)):
D[i,i]=S[i]
return U @ D @ VT
X = np.array([ [1.0,-1.0], [1.0,-1.0]])
#X = np.array([[1, 2], [3, 4], [5, 6]])
print(X)
C = SVD(X)
# Print the difference between the original matrix and the SVD one
print(C-X)
[[ 1. -1.]
[ 1. -1.]]
test U
[[0. 0.]
[0. 0.]]
test VT
[[0. 0.]
[0. 0.]]
[[-0.70710678 -0.70710678]
[-0.70710678 0.70710678]]
[2.00000000e+00 3.35470445e-17]
[[-0.70710678 0.70710678]
[ 0.70710678 0.70710678]]
[[-3.33066907e-16 4.44089210e-16]
[ 0.00000000e+00 2.22044605e-16]]
The matrix
The
As you can see from the code, the
If you wish to include the zero singular values, you will need to resize the matrices and set up a diagonal matrix as done in the above example
4.5. Code for SVD and Inversion of Matrices#
How do we use the SVD to invert a matrix
#Ainv = np.linlag.pinv(A)
Let us first look at a matrix which does not causes problems and write our own function where we just use the SVD.
import numpy as np
# SVD inversion
def SVDinv(A):
''' Takes as input a numpy matrix A and returns inv(A) based on singular value decomposition (SVD).
SVD is numerically more stable than the inversion algorithms provided by
numpy and scipy.linalg at the cost of being slower.
'''
U, s, VT = np.linalg.svd(A)
print('test U')
print( (np.transpose(U) @ U - U @np.transpose(U)))
print('test VT')
print( (np.transpose(VT) @ VT - VT @np.transpose(VT)))
D = np.zeros((len(U),len(VT)))
D = np.diag(s)
UT = np.transpose(U); V = np.transpose(VT); invD = np.linalg.inv(D)
return np.matmul(V,np.matmul(invD,UT))
# Non-singular square matrix
X = np.array( [ [1,2,3],[2,4,5],[3,5,6]])
print(X)
A = np.transpose(X) @ X
# Brute force inversion
B = np.linalg.pinv(A) # here we could use np.linalg.inv(A), try it!
C = SVDinv(A)
print(np.abs(B-C))
[[1 2 3]
[2 4 5]
[3 5 6]]
test U
[[ 4.44089210e-16 -4.69484813e-16 -6.67314874e-16]
[-4.69484813e-16 -4.44089210e-16 -1.54041041e-16]
[-6.67314874e-16 -1.54041041e-16 1.11022302e-16]]
test VT
[[ 2.22044605e-16 3.78156479e-17 1.85278920e-16]
[ 3.78156479e-17 0.00000000e+00 -6.33166055e-17]
[ 1.85278920e-16 -6.33166055e-17 -1.11022302e-16]]
[[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]]
Although our matrix to invert
The pseudoinverse is the generalization of the matrix inverse for square matrices to rectangular matrices where the number of rows and columns are not equal.
It is also called the the Moore-Penrose Inverse after two independent discoverers of the method or the Generalized Inverse. It is used for the calculation of the inverse for singular or near singular matrices and for rectangular matrices.
Using the SVD we can obtain the pseudoinverse (PI) of a matrix
where
import numpy as np
# SVD inversion
def SVDinv(A):
U, s, VT = np.linalg.svd(A)
# reciprocals of singular values of s
d = 1.0 / s
# create m x n D matrix
D = np.zeros(A.shape)
# populate D with n x n diagonal matrix
D[:A.shape[1], :A.shape[1]] = np.diag(d)
UT = np.transpose(U)
V = np.transpose(VT)
return np.matmul(V,np.matmul(D.T,UT))
A = np.array([ [0.3, 0.4], [0.5, 0.6], [0.7, 0.8],[0.9, 1.0]])
print(A)
# Brute force inversion of super-collinear matrix
B = np.linalg.pinv(A)
print(B)
# Compare our own algorithm with pinv
C = SVDinv(A)
print(np.abs(C-B))
[[0.3 0.4]
[0.5 0.6]
[0.7 0.8]
[0.9 1. ]]
[[-13. -6. 1. 8. ]
[ 11.5 5.5 -0.5 -6.5]]
[[0. 0. 0. 0.]
[0. 0. 0. 0.]]
As you can see from these examples, our own decomposition based on the SVD agrees the pseudoinverse algorithm provided by Numpy.
4.6. Mathematics of the SVD and implications#
Let us take a closer look at the mathematics of the SVD and the various implications for machine learning studies.
Our starting point is our design matrix
We can SVD decompose our matrix as
where
Similarly,
Finally
All values beyond
As an example, consider the following
The singular values are
where
contains only the singular values. Note also (and we will use this below) that
which is a
is a
The matrix that may cause problems for us is
and using the orthogonality of the matrix
We define
We can now insert the result for the matrix
and using our SVD decomposition of
which gives us, using the orthogonality of the matrices
It means that the ordinary least square model (with the optimal parameters)
4.7. Further properties (important for our analyses later)#
Let us study again
If we now multiply from the right with
This means the vectors
Similarly, if we use the SVD decomposition for the matrix
If we now multiply from the right with
This means the vectors
Important note: we have defined our design matrix
In our lectures, the number of columns will always refer to the number of features in our data set, while the number of rows represents the number of data inputs. Note that in other texts you may find the opposite notation. This has consequences for the definition of for example the covariance matrix and its relation to the SVD.
4.8. Meet the Covariance Matrix#
Before we move on to a discussion of Ridge and Lasso regression, we want to show an important example of the above.
We have already noted that the matrix
This quantity defines what is called the Hessian matrix (the second derivative of the cost function we want to optimize).
The Hessian matrix plays an important role and is defined in this course as
The Hessian matrix for ordinary least squares is also proportional to the covariance matrix. This means also that we can use the SVD to find the eigenvalues of the covariance matrix and the Hessian matrix in terms of the singular values. Let us develop these arguments, as they will play an important role in our machine learning studies.
Before we discuss the link between for example Ridge regression and the singular value decomposition, we need to remind ourselves about the definition of the covariance and the correlation function. These are quantities that play a central role in machine learning methods.
Suppose we have defined two vectors
where for example
With this definition and recalling that the variance is defined as
we can rewrite the covariance matrix as
Note: we have used
The covariance takes values between zero and infinity and may thus lead to problems with loss of numerical precision for particularly large values. It is common to scale the covariance matrix by introducing instead the correlation matrix defined via the so-called correlation function
The correlation function is then given by values
In the above example this is the function we constructed using pandas.
In our derivation of the various regression algorithms like Ordinary Least Squares or Ridge regression
we defined the design/feature matrix
with
with a given vector
With these definitions, we can now rewrite our
and the correlation matrix
The Numpy function np.cov calculates the covariance elements using
the factor
Note that this assumes you have the features as the rows, and the inputs as columns, that is
which in turn is converted into into the
# Importing various packages
import numpy as np
n = 100
x = np.random.normal(size=n)
print(np.mean(x))
y = 4+3*x+np.random.normal(size=n)
print(np.mean(y))
W = np.vstack((x, y))
C = np.cov(W)
print(C)
0.00239782448350092
3.945643027765946
[[1.04582521 3.00214849]
[3.00214849 9.76163974]]
The previous example can be converted into the correlation matrix by
simply scaling the matrix elements with the variances. We should also
subtract the mean values for each column. This leads to the following
code which sets up the correlations matrix for the previous example in
a more brute force way. Here we scale the mean values for each column of the design matrix, calculate the relevant mean values and variances and then finally set up the
import numpy as np
n = 100
# define two vectors
x = np.random.random(size=n)
y = 4+3*x+np.random.normal(size=n)
#scaling the x and y vectors
x = x - np.mean(x)
y = y - np.mean(y)
variance_x = np.sum(x@x)/n
variance_y = np.sum(y@y)/n
print(variance_x)
print(variance_y)
cov_xy = np.sum(x@y)/n
cov_xx = np.sum(x@x)/n
cov_yy = np.sum(y@y)/n
C = np.zeros((2,2))
C[0,0]= cov_xx/variance_x
C[1,1]= cov_yy/variance_y
C[0,1]= cov_xy/np.sqrt(variance_y*variance_x)
C[1,0]= C[0,1]
print(C)
0.07878080847813602
2.1500369042116563
[[1. 0.63295205]
[0.63295205 1. ]]
We see that the matrix elements along the diagonal are one as they should be and that the matrix is symmetric. Furthermore, diagonalizing this matrix we easily see that it is a positive definite matrix.
The above procedure with numpy can be made more compact if we use pandas.
We know here how we can set up the correlation matrix using pandas, as done in this simple code
import numpy as np
import pandas as pd
n = 10
x = np.random.normal(size=n)
x = x - np.mean(x)
y = 4+3*x+np.random.normal(size=n)
y = y - np.mean(y)
# Note that we transpose the matrix in order to stay with our ordering n x p
X = (np.vstack((x, y))).T
print(X)
Xpd = pd.DataFrame(X)
print(Xpd)
correlation_matrix = Xpd.corr()
print(correlation_matrix)
[[-0.05862476 0.45533295]
[ 0.84689611 4.01381445]
[ 0.21778772 0.62032544]
[-1.3117471 -2.69182187]
[-0.78512935 -3.44198948]
[-0.12030798 -0.93994208]
[ 0.60559053 2.2457079 ]
[ 1.8170149 4.51652722]
[-0.44215901 -3.25646952]
[-0.76932106 -1.52148501]]
0 1
0 -0.058625 0.455333
1 0.846896 4.013814
2 0.217788 0.620325
3 -1.311747 -2.691822
4 -0.785129 -3.441989
5 -0.120308 -0.939942
6 0.605591 2.245708
7 1.817015 4.516527
8 -0.442159 -3.256470
9 -0.769321 -1.521485
0 1
0 1.00000 0.92403
1 0.92403 1.00000
We expand this model to the Franke function discussed earlier.
# Common imports
import numpy as np
import pandas as pd
def FrankeFunction(x,y):
term1 = 0.75*np.exp(-(0.25*(9*x-2)**2) - 0.25*((9*y-2)**2))
term2 = 0.75*np.exp(-((9*x+1)**2)/49.0 - 0.1*(9*y+1))
term3 = 0.5*np.exp(-(9*x-7)**2/4.0 - 0.25*((9*y-3)**2))
term4 = -0.2*np.exp(-(9*x-4)**2 - (9*y-7)**2)
return term1 + term2 + term3 + term4
def create_X(x, y, n ):
if len(x.shape) > 1:
x = np.ravel(x)
y = np.ravel(y)
N = len(x)
l = int((n+1)*(n+2)/2) # Number of elements in beta
X = np.ones((N,l))
for i in range(1,n+1):
q = int((i)*(i+1)/2)
for k in range(i+1):
X[:,q+k] = (x**(i-k))*(y**k)
return X
# Making meshgrid of datapoints and compute Franke's function
n = 4
N = 100
x = np.sort(np.random.uniform(0, 1, N))
y = np.sort(np.random.uniform(0, 1, N))
z = FrankeFunction(x, y)
X = create_X(x, y, n=n)
Xpd = pd.DataFrame(X)
# subtract the mean values and set up the covariance matrix
Xpd = Xpd - Xpd.mean()
covariance_matrix = Xpd.cov()
print(covariance_matrix)
0 1 2 3 4 5 6 7 \
0 0.0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
1 0.0 0.081622 0.081406 0.078296 0.079396 0.080527 0.068235 0.069327
2 0.0 0.081406 0.081919 0.078612 0.080089 0.081603 0.068822 0.070146
3 0.0 0.078296 0.078612 0.080411 0.081717 0.083038 0.073368 0.074578
4 0.0 0.079396 0.080089 0.081717 0.083265 0.084831 0.074620 0.075994
5 0.0 0.080527 0.081603 0.083038 0.084831 0.086645 0.075870 0.077411
6 0.0 0.068235 0.068822 0.073368 0.074620 0.075870 0.069194 0.070300
7 0.0 0.069327 0.070146 0.074578 0.075994 0.077411 0.070300 0.071523
8 0.0 0.070483 0.071536 0.075839 0.077423 0.079011 0.071441 0.072784
9 0.0 0.071702 0.072994 0.077153 0.078908 0.080672 0.072619 0.074087
10 0.0 0.059187 0.059821 0.065770 0.066863 0.067945 0.063580 0.064519
11 0.0 0.060114 0.060903 0.066767 0.067977 0.069178 0.064472 0.065498
12 0.0 0.061096 0.062042 0.067812 0.069142 0.070466 0.065402 0.066517
13 0.0 0.062134 0.063240 0.068908 0.070362 0.071812 0.066371 0.067578
14 0.0 0.063230 0.064502 0.070057 0.071639 0.073220 0.067381 0.068684
8 9 10 11 12 13 14
0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
1 0.070483 0.071702 0.059187 0.060114 0.061096 0.062134 0.063230
2 0.071536 0.072994 0.059821 0.060903 0.062042 0.063240 0.064502
3 0.075839 0.077153 0.065770 0.066767 0.067812 0.068908 0.070057
4 0.077423 0.078908 0.066863 0.067977 0.069142 0.070362 0.071639
5 0.079011 0.080672 0.067945 0.069178 0.070466 0.071812 0.073220
6 0.071441 0.072619 0.063580 0.064472 0.065402 0.066371 0.067381
7 0.072784 0.074087 0.064519 0.065498 0.066517 0.067578 0.068684
8 0.074170 0.075601 0.065481 0.066550 0.067660 0.068817 0.070021
9 0.075601 0.077163 0.066469 0.067629 0.068835 0.070089 0.071396
10 0.065481 0.066469 0.059543 0.060291 0.061066 0.061870 0.062705
11 0.066550 0.067629 0.060291 0.061104 0.061946 0.062820 0.063728
12 0.067660 0.068835 0.061066 0.061946 0.062859 0.063805 0.064788
13 0.068817 0.070089 0.061870 0.062820 0.063805 0.064827 0.065887
14 0.070021 0.071396 0.062705 0.063728 0.064788 0.065887 0.067030
We note here that the covariance is zero for the first rows and
columns since all matrix elements in the design matrix were set to one
(we are fitting the function in terms of a polynomial of degree
This means that the variance for these elements will be zero and will cause problems when we set up the correlation matrix. We can simply drop these elements and construct a correlation matrix without these elements.
We can rewrite the covariance matrix in a more compact form in terms of the design/feature matrix
To see this let us simply look at a design matrix
If we then compute the expectation value (note the
which is just
where we wrote
It is easy to generalize this to a matrix
4.9. Linking with the SVD#
We saw earlier that
Since the matrices here have dimension
where the tilde-matrix
meaning we can write
Multiplying from the right with
This means the vectors
In other words, each non-zero singular value of
Note that these are also the eigenvectors and eigenvalues of the Hessian matrix.
If we now recall the definition of the covariance matrix (not using Bessel’s correction) we have
meaning that every squared non-singular value of
For
Since the matrices here have dimension
leading to
Multiplying with
It means that the eigenvalues of
Since we will mainly be interested in the correlations among the features
of our data (the columns of
4.10. Ridge and Lasso Regression#
Let us remind ourselves about the expression for the standard Mean Squared Error (MSE) which we used to define our cost function and the equations for the ordinary least squares (OLS) method, that is our optimization problem is
or we can state it as
where we have used the definition of a norm-2 vector, that is
By minimizing the above equation with respect to the parameters
which leads to the Ridge regression minimization problem where we
require that
we have a new optimization equation
which leads to Lasso regression. Lasso stands for least absolute shrinkage and selection operator.
Here we have defined the norm-1 as
Using the matrix-vector expression for Ridge regression and dropping the parameter
and
taking the derivatives with respect to
with
with
When we compare this with the ordinary least squares result we have
which can lead to singular matrices. However, with the SVD, we can always compute the inverse of the matrix
We see that Ridge regression is nothing but the standard OLS with a
modified diagonal term added to
Using our insights about the SVD of the design matrix
For Ridge regression this becomes
with the vectors
Since
Ridge regression finds the coordinates of
For small eigenvalues
For the sake of simplicity, let us assume that the design matrix is orthonormal, that is
In this case the standard OLS results in
and
that is the Ridge estimator scales the OLS estimator by the inverse of a factor
We will come back to more interpreations after we have gone through some of the statistical analysis part.
Using the matrix-vector expression for Lasso regression and dropping the parameter
Taking the derivative with respect to
we have that the derivative of the cost function is
and reordering we have
This equation does not lead to a nice analytical equation as in Ridge regression or ordinary least squares. This equation can however be solved by using standard convex optimization algorithms using for example the Python package CVXOPT. We will discuss this later.
Let us assume that our design matrix is given by unit (identity) matrix, that is a square diagonal matrix with ones only along the
diagonal. In this case we have an equal number of rows and columns
Our model approximation is just
and minimizing we have that
For Ridge regression our cost function is
and minimizing we have that
For Lasso regression our cost function is
and minimizing we have that
which leads to
Plotting these results (figure in handwritten notes for week 36) shows clearly that Lasso regression suppresses (sets to zero) values of
As another example, let us assume we have a data set with outputs/targets given by the vector
and our inputs as a
meaning that we have two features and two unknown parameters
For ordinary least squares (OLS) we know that the optimal solution is
Inserting the above values we obtain that
The code which implements this simpler case is presented after the discussion of Ridge and Lasso.
For Ridge regression we have
Inserting the above values we obtain that
There is normally a constraint on the value of
To see this, let us write the cost function for Ridge regression.
We define the MSE without the
and taking the derivative with respect to
and for
Using the constraint for
which gives
For Lasso we need now, keeping a constraint on
and
We have now four cases to solve besides the trivial cases
and , and , and , and .
If we consider the first case, we have then
and
which yields
and
Using the constraint on
Here we set up the OLS, Ridge and Lasso functionality in order to study the above example. Note that here we have opted for a set of values of
First we study and compare the OLS and Ridge results. The next code compares all three methods.
We select values of the hyperparameter
%matplotlib inline
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
def R2(y_data, y_model):
return 1 - np.sum((y_data - y_model) ** 2) / np.sum((y_data - np.mean(y_data)) ** 2)
def MSE(y_data,y_model):
n = np.size(y_model)
return np.sum((y_data-y_model)**2)/n
# A seed just to ensure that the random numbers are the same for every run.
# Useful for eventual debugging.
X = np.array( [ [ 2, 0], [0, 1], [0,0]])
y = np.array( [4, 2, 3])
# matrix inversion to find beta
OLSbeta = np.linalg.inv(X.T @ X) @ X.T @ y
print(OLSbeta)
# and then make the prediction
ytildeOLS = X @ OLSbeta
print("Training MSE for OLS")
print(MSE(y,ytildeOLS))
ypredictOLS = X @ OLSbeta
# Repeat now for Ridge regression and various values of the regularization parameter
I = np.eye(2,2)
# Decide which values of lambda to use
nlambdas = 100
MSEPredict = np.zeros(nlambdas)
lambdas = np.logspace(-4, 4, nlambdas)
for i in range(nlambdas):
lmb = lambdas[i]
Ridgebeta = np.linalg.inv(X.T @ X+lmb*I) @ X.T @ y
# print(Ridgebeta)
# and then make the prediction
ypredictRidge = X @ Ridgebeta
MSEPredict[i] = MSE(y,ypredictRidge)
# print(MSEPredict[i])
# Now plot the results
plt.figure()
plt.plot(np.log10(lambdas), MSEPredict, 'r--', label = 'MSE Ridge Train')
plt.xlabel('log10(lambda)')
plt.ylabel('MSE')
plt.legend()
plt.show()
[2. 2.]
Training MSE for OLS
3.0

We see here that we reach a plateau for the Ridge results. Writing out the coefficients
This happens also for Lasso regression, as seen from the next code
output. The difference is that Lasso shrinks the values of
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import linear_model
def R2(y_data, y_model):
return 1 - np.sum((y_data - y_model) ** 2) / np.sum((y_data - np.mean(y_data)) ** 2)
def MSE(y_data,y_model):
n = np.size(y_model)
return np.sum((y_data-y_model)**2)/n
# A seed just to ensure that the random numbers are the same for every run.
# Useful for eventual debugging.
X = np.array( [ [ 2, 0], [0, 1], [0,0]])
y = np.array( [4, 2, 3])
# matrix inversion to find beta
OLSbeta = np.linalg.inv(X.T @ X) @ X.T @ y
print(OLSbeta)
# and then make the prediction
ytildeOLS = X @ OLSbeta
print("Training MSE for OLS")
print(MSE(y,ytildeOLS))
ypredictOLS = X @ OLSbeta
# Repeat now for Ridge regression and various values of the regularization parameter
I = np.eye(2,2)
# Decide which values of lambda to use
nlambdas = 100
MSERidgePredict = np.zeros(nlambdas)
MSELassoPredict = np.zeros(nlambdas)
lambdas = np.logspace(-4, 4, nlambdas)
for i in range(nlambdas):
lmb = lambdas[i]
Ridgebeta = np.linalg.inv(X.T @ X+lmb*I) @ X.T @ y
print(Ridgebeta)
# and then make the prediction
ypredictRidge = X @ Ridgebeta
MSERidgePredict[i] = MSE(y,ypredictRidge)
RegLasso = linear_model.Lasso(lmb)
RegLasso.fit(X,y)
ypredictLasso = RegLasso.predict(X)
print(RegLasso.coef_)
MSELassoPredict[i] = MSE(y,ypredictLasso)
# Now plot the results
plt.figure()
plt.plot(np.log10(lambdas), MSERidgePredict, 'r--', label = 'MSE Ridge Train')
plt.plot(np.log10(lambdas), MSELassoPredict, 'r--', label = 'MSE Lasso Train')
plt.xlabel('log10(lambda)')
plt.ylabel('MSE')
plt.legend()
plt.show()
[2. 2.]
Training MSE for OLS
3.0
[1.99995 1.99980002]
[ 0.50001525 -0.99953475]
[1.99993978 1.99975913]
[ 0.50001525 -0.99944272]
[1.99992746 1.99970988]
[ 0.50001525 -0.99933188]
[1.99991263 1.99965056]
[ 0.50001525 -0.99919837]
[1.99989476 1.99957911]
[ 0.50001524 -0.99903755]
[1.99987324 1.99949306]
[ 0.50001524 -0.99884384]
[1.99984732 1.99938942]
[ 0.50001524 -0.99861053]
[1.9998161 1.99926459]
[ 0.50001523 -0.9983295 ]
[1.99977849 1.99911427]
[ 0.50001523 -0.99799099]
[1.9997332 1.99893323]
[ 0.50001522 -0.99758326]
[1.99967865 1.99871521]
[ 0.50001521 -0.99709215]
[1.99961294 1.99845267]
[ 0.50001521 -0.99650061]
[1.99953381 1.99813653]
[ 0.50001519 -0.99578809]
[1.9994385 1.99775587]
[ 0.50001518 -0.99492986]
[1.9993237 1.99729756]
[ 0.50001517 -0.99389612]
[1.99918546 1.9967458 ]
[ 0.50001515 -0.99265097]
[1.99901896 1.99608161]
[ 0.50001512 -0.99115119]
[1.99881845 1.99528218]
[ 0.5000151 -0.9893447]
[1.998577 1.9943201]
[ 0.50001506 -0.98716878]
[1.99828624 1.99316252]
[ 0.50001502 -0.98454786]
[1.99793613 1.99176998]
[ 0.50001498 -0.98139097]
[1.99751458 1.99009525]
[ 0.50001492 -0.97758848]
[1.99700706 1.98808176]
[ 0.50001485 -0.97300836]
[1.9963961 1.98566191]
[ 0.50001476 -0.9674916 ]
[1.99566069 1.98275501]
[ 0.50001466 -0.96084663]
[1.9947756 1.97926491]
[ 0.50001454 -0.95284275]
[1.99371056 1.97507735]
[ 0.50001439 -0.94320205]
[1.99242921 1.97005689]
[ 0.50001422 -0.93158979]
[1.99088801 1.9640435 ]
[ 0.500014 -0.91760278]
[1.9890348 1.95684892]
[ 0.50001374 -0.90075537]
[1.98680716 1.9482527 ]
[ 0.50001343 -0.88046261]
[1.98413059 1.93799826]
[ 0.50001306 -0.85601992]
[1.98091621 1.92578916]
[ 0.50001261 -0.8265786 ]
[1.97705827 1.91128596]
[ 0.50001207 -0.79111643]
[1.97243128 1.89410423]
[ 0.50001142 -0.74840212]
[1.96688672 1.87381451]
[ 0.50001063 -0.69695259]
[1.96024953 1.84994524]
[ 0.50000969 -0.63498144]
[1.95231424 1.82198978]
[ 0.50000855 -0.56033697]
[1.94284104 1.78941903]
[ 0.50000718 -0.47042744]
[1.93155188 1.75170092]
[ 0.50000553 -0.3621311 ]
[1.91812702 1.70832814]
[ 0.50001414 -0.23167717]
[1.90220243 1.65885453]
[ 0.50000455 -0.07456491]
[1.88336879 1.60293962]
[ 0.47132891 -0. ]
[1.86117291 1.54039921]
[ 0.41433969 -0. ]
[1.83512277 1.47125748]
[ 0.34569596 -0. ]
[1.80469739 1.39579407]
[ 0.26301436 -0. ]
[1.76936315 1.31457796]
[ 0.16342407 -0. ]
[1.72859758 1.22847924]
[ 0.04346721 -0. ]
[1.68192193 1.13865173]
[ 0. -0.]
[1.62894215 1.04648335]
[ 0. -0.]
[1.56939714 0.95351665]
[ 0. -0.]
[1.50321091 0.86134827]
[ 0. -0.]
[1.43054282 0.77152076]
[ 0. -0.]
[1.35182854 0.68542204]
[ 0. -0.]
[1.26780278 0.60420593]
[ 0. -0.]
[1.17949575 0.52874252]
[ 0. -0.]
[1.0881981 0.45960079]
[ 0. -0.]
[0.99539415 0.39706038]
[ 0. -0.]
[0.90266948 0.34114547]
[ 0. -0.]
[0.81160425 0.29167186]
[ 0. -0.]
[0.7236674 0.24829908]
[ 0. -0.]
[0.64012627 0.21058097]
[ 0. -0.]
[0.56198284 0.17801022]
[ 0. -0.]
[0.48994188 0.15005476]
[ 0. -0.]
[0.42441033 0.12618549]
[ 0. -0.]
[0.3655222 0.10589577]
[ 0. -0.]
[0.31318084 0.08871404]
[ 0. -0.]
[0.26710969 0.07421084]
[ 0. -0.]
[0.22690428 0.06200174]
[ 0. -0.]
[0.19207979 0.0517473 ]
[ 0. -0.]
[0.16211139 0.04315108]
[ 0. -0.]
[0.13646574 0.0359565 ]
[ 0. -0.]
[0.11462415 0.02994311]
[ 0. -0.]
[0.09609807 0.02492265]
[ 0. -0.]
[0.08043851 0.02073509]
[ 0. -0.]
[0.06724062 0.01724499]
[ 0. -0.]
[0.05614483 0.01433809]
[ 0. -0.]
[0.04683565 0.01191824]
[ 0. -0.]
[0.039039 0.00990475]
[ 0. -0.]
[0.03251863 0.00823002]
[ 0. -0.]
[0.02707227 0.00683748]
[ 0. -0.]
[0.02252765 0.0056799 ]
[ 0. -0.]
[0.01873869 0.00471782]
[ 0. -0.]
[0.01558197 0.00391839]
[ 0. -0.]
[0.01295356 0.0032542 ]
[ 0. -0.]
[0.01076611 0.00270244]
[ 0. -0.]
[0.00894639 0.00224413]
[ 0. -0.]
[0.0074331 0.00186347]
[ 0. -0.]
[0.00617499 0.00154733]
[ 0. -0.]
[0.00512927 0.00128479]
[ 0. -0.]
[0.00426027 0.00106677]
[ 0. -0.]
[0.00353823 0.00088573]
[ 0. -0.]
[0.00293838 0.00073541]
[ 0. -0.]
[0.0024401 0.00061058]
[ 0. -0.]
[0.00202624 0.00050694]
[ 0. -0.]
[0.00168251 0.00042089]
[ 0. -0.]
[0.00139705 0.00034944]
[ 0. -0.]
[0.00115999 0.00029012]
[ 0. -0.]
[0.00096314 0.00024087]
[ 0. -0.]
[0.00079968 0.00019998]
[ 0. -0.]

We bring then back our exponential function example and study all
three regression methods. Depending on the level of noise, we note
that for small values of the hyperparameter
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn import linear_model
def R2(y_data, y_model):
return 1 - np.sum((y_data - y_model) ** 2) / np.sum((y_data - np.mean(y_data)) ** 2)
def MSE(y_data,y_model):
n = np.size(y_model)
return np.sum((y_data-y_model)**2)/n
# A seed just to ensure that the random numbers are the same for every run.
# Useful for eventual debugging.
np.random.seed(3155)
x = np.random.rand(100)
y = 2.0+5*x*x+0.1*np.random.randn(100)
# number of features p (here degree of polynomial
p = 3
# The design matrix now as function of a given polynomial
X = np.zeros((len(x),p))
X[:,0] = 1.0
X[:,1] = x
X[:,2] = x*x
# We split the data in test and training data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# matrix inversion to find beta
OLSbeta = np.linalg.inv(X_train.T @ X_train) @ X_train.T @ y_train
print(OLSbeta)
# and then make the prediction
ytildeOLS = X_train @ OLSbeta
print("Training MSE for OLS")
print(MSE(y_train,ytildeOLS))
ypredictOLS = X_test @ OLSbeta
print("Test MSE OLS")
print(MSE(y_test,ypredictOLS))
# Repeat now for Lasso and Ridge regression and various values of the regularization parameter
I = np.eye(p,p)
# Decide which values of lambda to use
nlambdas = 100
MSEPredict = np.zeros(nlambdas)
MSETrain = np.zeros(nlambdas)
MSELassoPredict = np.zeros(nlambdas)
MSELassoTrain = np.zeros(nlambdas)
lambdas = np.logspace(-4, 4, nlambdas)
for i in range(nlambdas):
lmb = lambdas[i]
Ridgebeta = np.linalg.inv(X_train.T @ X_train+lmb*I) @ X_train.T @ y_train
# include lasso using Scikit-Learn
RegLasso = linear_model.Lasso(lmb)
RegLasso.fit(X_train,y_train)
# and then make the prediction
ytildeRidge = X_train @ Ridgebeta
ypredictRidge = X_test @ Ridgebeta
ytildeLasso = RegLasso.predict(X_train)
ypredictLasso = RegLasso.predict(X_test)
MSEPredict[i] = MSE(y_test,ypredictRidge)
MSETrain[i] = MSE(y_train,ytildeRidge)
MSELassoPredict[i] = MSE(y_test,ypredictLasso)
MSELassoTrain[i] = MSE(y_train,ytildeLasso)
# Now plot the results
plt.figure()
plt.plot(np.log10(lambdas), MSETrain, label = 'MSE Ridge train')
plt.plot(np.log10(lambdas), MSEPredict, 'r--', label = 'MSE Ridge Test')
plt.plot(np.log10(lambdas), MSELassoTrain, label = 'MSE Lasso train')
plt.plot(np.log10(lambdas), MSELassoPredict, 'r--', label = 'MSE Lasso Test')
plt.xlabel('log10(lambda)')
plt.ylabel('MSE')
plt.legend()
plt.show()
[ 2.03099776 -0.17917768 5.18029127]
Training MSE for OLS
0.009163470508352218
Test MSE OLS
0.008675369724975977

Both these example send a clear message. The addition of a
shrinkage/regularization term implies that we need to perform a search
for the optimal values of
As a small addendum, we note that you can also solve this problem using the convex optimization package CVXOPT. This requires, in addition to having installed CVXOPT, you need to download the file l1regl.py.
4.11. Linking the regression analysis with a statistical interpretation#
We will now couple the discussions of ordinary least squares, Ridge
and Lasso regression with a statistical interpretation, that is we
move from a linear algebra analysis to a statistical analysis. In
particular, we will focus on what the regularization terms can result
in. We will amongst other things show that the regularization
parameter can reduce considerably the variance of the parameters
The
advantage of doing linear regression is that we actually end up with
analytical expressions for several statistical quantities.
Standard least squares and Ridge regression allow us to
derive quantities like the variance and other expectation values in a
rather straightforward way.
It is assumed that
The randomness of
Recall that
The assumption we have made here can be summarized as (and this is going to be useful when we discuss the bias-variance trade off)
that there exists a function
We approximate this function with our model from the solution of the linear regression equations, that is our
function
We can calculate the expectation value of
while its variance is
Hence,
With the OLS expressions for the parameters
This means that the estimator of the regression parameters is unbiased.
We can also calculate the variance
The variance of
where we have used that
In a similar way, we can obtain analytical expressions for say the
expectation values of the parameters
It is rather straightforward to show that
We see clearly that
We can also compute the variance as
and it is easy to see that if the parameter
With this, we can compute the difference
The difference is non-negative definite since each component of the
matrix product is non-negative definite.
This means the variance we obtain with the standard OLS will always for
4.12. Deriving OLS from a probability distribution#
Our basic assumption when we derived the OLS equations was to assume
that our output is determined by a given continuous function
We found above that the outputs
We assume now that the various
which reads as finding the likelihood of an event
Since these events are assumed to be independent and identically distributed we can build the probability distribution function (PDF) for all possible event
We will write this in a more compact form reserving
In the more general case the various inputs should be replaced by the possible features represented by the input data set
It is a conditional probability (see below) and reads as the
likelihood of a domain of events
In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is the most probable.
We will assume here that our events are given by the above Gaussian
distribution and we will determine the optimal parameters
In practice, it is more convenient to maximize the logarithm of the PDF because it is a monotonically increasing function of the argument. Alternatively, and this will be our option, we will minimize the negative of the logarithm since this is a monotonically decreasing function.
Note also that maximization/minimization of the logarithm of the PDF is equivalent to the maximization/minimization of the function itself.
We could now define a new cost function to minimize, namely the negative logarithm of the above PDF
which becomes
Taking the derivative of the new cost function with respect to the parameters
which leads to the well-known OLS equation for the optimal paramters
Before we make a similar analysis for Ridge and Lasso regression, we need a short reminder on statistics.
A central theorem in statistics is Bayes’ theorem. This theorem plays a similar role as the good old Pythagoras’ theorem in geometry. Bayes’ theorem is extremely simple to derive. But to do so we need some basic axioms from statistics.
Assume we have two domains of events
We define also the likelihood for
The union of events is given by
The product rule (aka joint probability) is given by
where we read
If we have independent events then
The marginal probability is defined in terms of only one of the set of variables
The conditional probability, if
If we combine the conditional probability with the marginal probability and the standard product rule, we have
which we can rewrite as
which is Bayes’ theorem. It allows us to evaluate the uncertainty in in
The quantity
The function
Let us try to illustrate Bayes’ theorem through an example.
Let us suppose that you are undergoing a series of mammography scans
in order to rule out possible breast cancer cases. We define the
sensitivity for a positive event by the variable
We let
Let us assume that if you have breast cancer, the test will be positive with a probability of
This obviously sounds scary since many would conclude that if the test
is positive, there is a likelihood of
instead of
If we look at various national surveys on breast cancer, the general likelihood of developing breast cancer is a very small number. Let us assume that the prior probability in the population as a whole is
We need also to account for the fact that the test may produce a false positive result (false alarm). Let us here assume that we have
Using Bayes’ theorem we can then find the posterior probability that the person has breast cancer in case of a positive test, that is we can compute
That is, in case of a positive test, there is only a
4.13. Bayes’ Theorem and Ridge and Lasso Regression#
Hitherto we have discussed Ridge and Lasso regression in terms of a linear analysis. This may to many of you feel rather technical and perhaps not that intuitive. The question is whether we can develop a more intuitive way of understanding what Ridge and Lasso express.
Before we proceed let us perform a Ridge, Lasso and OLS analysis of a polynomial fit.
We will play around with a study of the values for the optimal
parameters
For Ridge and Lasso regression, the higher order parameters will typically be reduced, providing thereby less fluctuations from one order to another one.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn import linear_model
def R2(y_data, y_model):
return 1 - np.sum((y_data - y_model) ** 2) / np.sum((y_data - np.mean(y_data)) ** 2)
def MSE(y_data,y_model):
n = np.size(y_model)
return np.sum((y_data-y_model)**2)/n
# Make data set.
n = 10000
x = np.random.rand(n)
y = np.exp(-x**2) + 1.5 * np.exp(-(x-2)**2)+ np.random.randn(n)
Maxpolydegree = 5
X = np.zeros((len(x),Maxpolydegree))
X[:,0] = 1.0
for polydegree in range(1, Maxpolydegree):
for degree in range(polydegree):
X[:,degree] = x**(degree)
# We split the data in test and training data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# matrix inversion to find beta
OLSbeta = np.linalg.pinv(X_train.T @ X_train) @ X_train.T @ y_train
print(OLSbeta)
ypredictOLS = X_test @ OLSbeta
print("Test MSE OLS")
print(MSE(y_test,ypredictOLS))
# Repeat now for Lasso and Ridge regression and various values of the regularization parameter using Scikit-Learn
# Decide which values of lambda to use
nlambdas = 4
MSERidgePredict = np.zeros(nlambdas)
MSELassoPredict = np.zeros(nlambdas)
lambdas = np.logspace(-3, 1, nlambdas)
for i in range(nlambdas):
lmb = lambdas[i]
# Make the fit using Ridge and Lasso
RegRidge = linear_model.Ridge(lmb,fit_intercept=False)
RegRidge.fit(X_train,y_train)
RegLasso = linear_model.Lasso(lmb,fit_intercept=False)
RegLasso.fit(X_train,y_train)
# and then make the prediction
ypredictRidge = RegRidge.predict(X_test)
ypredictLasso = RegLasso.predict(X_test)
# Compute the MSE and print it
MSERidgePredict[i] = MSE(y_test,ypredictRidge)
MSELassoPredict[i] = MSE(y_test,ypredictLasso)
# print(lmb,RegRidge.coef_)
# print(lmb,RegLasso.coef_)
# Now plot the results
plt.figure()
plt.plot(np.log10(lambdas), MSERidgePredict, 'b', label = 'MSE Ridge Test')
plt.plot(np.log10(lambdas), MSELassoPredict, 'r', label = 'MSE Lasso Test')
plt.xlabel('log10(lambda)')
plt.ylabel('MSE')
plt.legend()
plt.show()
[ 1.0169643 0.27924636 -1.4087793 1.03308408 0. ]
Test MSE OLS
0.958228616652075

How can we understand this?
Let us write out the values of the coefficients
If we don’t include any noise and run this code for different values
of the polynomial degree, we notice that the results for
If we however add noise, what happens is that the polynomial fit is
trying to adjust the fit to traverse in the best possible way all data
points. This can lead to large fluctuations in the parameters
import numpy as np
import pandas as pd
from IPython.display import display
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn import linear_model
# Make data set.
n = 1000
x = np.random.rand(n)
y = np.exp(-x**2) + 1.5 * np.exp(-(x-2)**2)+ np.random.randn(n)
Maxpolydegree = 5
X = np.zeros((len(x),Maxpolydegree))
X[:,0] = 1.0
for polydegree in range(1, Maxpolydegree):
for degree in range(polydegree):
X[:,degree] = x**(degree)
# We split the data in test and training data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Decide which values of lambda to use
nlambdas = 5
lambdas = np.logspace(-3, 2, nlambdas)
for i in range(nlambdas):
lmb = lambdas[i]
# Make the fit using Ridge only
RegRidge = linear_model.Ridge(lmb,fit_intercept=False)
RegRidge.fit(X_train,y_train)
# and then make the prediction
ypredictRidge = RegRidge.predict(X_test)
Coeffs = np.array(RegRidge.coef_)
BetaValues = pd.DataFrame(Coeffs)
BetaValues.columns = ['beta']
display(BetaValues)
beta | |
---|---|
0 | 0.986699 |
1 | -0.606760 |
2 | 1.280573 |
3 | -0.850164 |
4 | 0.000000 |
beta | |
---|---|
0 | 0.978553 |
1 | -0.511888 |
2 | 1.051418 |
3 | -0.701370 |
4 | 0.000000 |
beta | |
---|---|
0 | 0.946957 |
1 | -0.162246 |
2 | 0.221921 |
3 | -0.167787 |
4 | 0.000000 |
beta | |
---|---|
0 | 0.906747 |
1 | 0.017665 |
2 | -0.029483 |
3 | -0.053849 |
4 | 0.000000 |
beta | |
---|---|
0 | 0.718165 |
1 | 0.156956 |
2 | 0.040102 |
3 | -0.001880 |
4 | 0.000000 |
As an exercise, repeat these calculations with ordinary least squares
only with and without noise. Calculate thereafter the variance of the
parameters
4.14. Linking Bayes’ Theorem with Ridge and Lasso Regression#
We have seen that Ridge regression suppresses those features which
have a small singular value. This corresponds to a feature which exhibits
a large variance in the parameters
For ordinary least squares we postulated that the maximum likelihood for the domain of events
is given by
In Bayes’ theorem this function plays the role of the so-called likelihood. We could now ask the question what is the posterior probability of a parameter set
Bayes’ theorem comes to our rescue here since (omitting the normalization constant)
We have a model for
With the posterior probability defined by a likelihood which we have already modeled and an unknown prior, we are now ready to make additional models for the prior.
We can, based on our discussions of the variance of
Our posterior probability becomes then (omitting the normalization factor which is just a constant)
We can now optimize this quantity with respect to
and replacing
which is our Ridge cost function! Nice, isn’t it?
To derive the Lasso cost function, we simply replace the Gaussian prior with an exponential distribution (Laplace in this case) with zero mean value, that is
Our posterior probability becomes then (omitting the normalization factor which is just a constant)
Taking the negative
logarithm of the posterior probability and leaving out the
constants terms that do not depend on
and replacing
which is our Lasso cost function!
Plotting these prior functions shows us that we can use the parameter