Week 35: From Ordinary Linear Regression to Ridge and Lasso Regression#
Morten Hjorth-Jensen, Department of Physics, University of Oslo
Date: August 26-30, 2024
Plans for week 35#
The main topics are:
Brief repetition from last week
Discussions of the equations for ordinary least squares
Discussion on how to prepare data and examples of applications of linear regression
Material for the lecture on Monday: Mathematical interpretations of linear regression
Monday: Ridge and Lasso regression and Singular Value Decomposition
Reading recommendations:#
These lecture notes
Goodfellow, Bengio and Courville, Deep Learning, chapter 2 on linear algebra and sections 3.1-3.10 on elements of statistics (background)
Raschka et al on preprocessing of data, relevant for exercise 3 this week, see chapter 4.
For exercise 1 of week 35, the book by A. Aldo Faisal, Cheng Soon Ong, and Marc Peter Deisenroth on the Mathematics of Machine Learning, may be very relevant. In particular chapter 5 at URL”https://mml-book.github.io/” (section 5.5 on derivatives) is very useful for exercise 1 this coming week.
For exercise sessions: Why Linear Regression (aka Ordinary Least Squares and family), repeat from last week#
We need first a reminder from last week about linear regression.
Fitting a continuous function with linear parameterization in terms of the parameters
Method of choice for fitting a continuous function!
Gives an excellent introduction to central Machine Learning features with understandable pedagogical links to other methods like Neural Networks, Support Vector Machines etc
Analytical expression for the fitting parameters
Analytical expressions for statistical propertiers like mean values, variances, confidence intervals and more
Analytical relation with probabilistic interpretations
Easy to introduce basic concepts like bias-variance tradeoff, cross-validation, resampling and regularization techniques and many other ML topics
Easy to code! And links well with classification problems and logistic regression and neural networks
Allows for easy hands-on understanding of gradient descent methods
and many more features
For more discussions of Ridge and Lasso regression, Wessel van Wieringen’s article is highly recommended. Similarly, Mehta et al’s article is also recommended.
The equations for ordinary least squares#
Our data which we want to apply a machine learning method on, consist
of a set of inputs
or in general
where
In linear regression we approximate the unknown function with another
continuous function
Last week we introduced the so-called design matrix in order to define
the approximation
and in order to find the optimal parameters
The cost/loss function#
We used the mean squared error to define the way we measure the quality of our model
or using the matrix
This function represents one of many possible ways to define the so-called cost function.
It is also common to define
the function
since when taking the first derivative with respect to the unknown parameters
Interpretations and optimizing our parameters#
The function
can be linked to the variance of the quantity
where
In order to find the parameters
In practical terms it means we will require
which results in
or in a matrix-vector form as (multiplying away the factor
Interpretations and optimizing our parameters#
We can rewrite, see the derivations below,
as
and if the matrix
We note also that since our design matrix is defined as
Small question: Do you think the example we have at hand here (the nuclear binding energies) can lead to problems in inverting the matrix
Some useful matrix and vector expressions#
The following matrix and vector relation will be useful here and for the rest of the course. Vectors are always written as boldfaced lower case letters and matrices as upper case boldfaced letters. In the following we will discuss how to calculate derivatives of various matrices relevant for machine learning. We will often represent our data in terms of matrices and vectors.
Let us introduce first some conventions. We assume that
We assume also that
The Jacobian#
We define the partial derivatives of the various components of
which is an
When this matrix is a square matrix
Derivatives, example 1#
Let now
with
From this we have, using the definition of the Jacobian
Example 2#
We define a scalar (our cost/loss functions are in general also scalars, just think of the mean squared error) as the result of some matrix vector multiplications
with
which means that (using our previous example and keeping track of our definition of the derivative of a scalar) we have
Note that the resulting vector elements are the same for
Since
Example 3#
We start with a new scalar but where now the vector
with
We write out the specific sums involved in the calculation of
taking the derivative of
for
If the matrix
Example 4#
We let the scalar
where both
and the partial derivative
for
and if
The mean squared error and its derivative#
We defined earlier a possible cost function using the mean squared error
or using the design/feature matrix
We note that the design matrix
The mean squared error is a scalar and if we use the results from example three above, we can define a new vector
which depends on
with partial derivative
and using that
where we used the result from example two above. Inserting the last expression we obtain
or as
Other useful relations#
We list here some other useful relations we may encounter (recall that vectors are defined by boldfaced low-key letters)
Meet the Hessian Matrix#
A very important matrix we will meet again and again in machine
learning is the Hessian. It is given by the second derivative of the
cost function with respect to the parameters
The Hessian matrix plays an important role and is defined here as
For ordinary least squares, it is inversely proportional (derivation
next week) with the variance of the optimal parameters
Linear algebra question: Can we use the Hessian matrix to say something about properties of the cost function (our optmization problem)? (hint: think about convex or concave problems and how to relate these to a matrix!).
Interpretations and optimizing our parameters#
The residuals
and with
we have
meaning that the solution for
Example relevant for the exercises#
In order to understand the relation among the predictors
we have five predictors/features. The first is the intercept
Own code for Ordinary Least Squares#
It is rather straightforward to implement the matrix inversion and obtain the parameters
# matrix inversion to find beta
# First we set up the data
import numpy as np
x = np.random.rand(100)
y = 2.0+5*x*x+0.1*np.random.randn(100)
# and then the design matrix X including the intercept
# The design matrix now as function of a fourth-order polynomial
X = np.zeros((len(x),5))
X[:,0] = 1.0
X[:,1] = x
X[:,2] = x**2
X[:,3] = x**3
X[:,4] = x**4
beta = (np.linalg.inv(X.T @ X) @ X.T ) @ y
# and then make the prediction
ytilde = X @ beta
Alternatively, you can use the least squares functionality in Numpy as
fit = np.linalg.lstsq(X, y, rcond =None)[0]
ytildenp = np.dot(fit,X.T)
Adding error analysis and training set up#
We can easily test our fit by computing the
def R2(y_data, y_model):
return 1 - np.sum((y_data - y_model) ** 2) / np.sum((y_data - np.mean(y_data)) ** 2)
and we would be using it as
print(R2(y,ytilde))
0.9954482269282812
We can easily add our MSE score as
def MSE(y_data,y_model):
n = np.size(y_model)
return np.sum((y_data-y_model)**2)/n
print(MSE(y,ytilde))
0.008936538168866639
and finally the relative error as
def RelativeError(y_data,y_model):
return abs((y_data-y_model)/y_data)
print(RelativeError(y, ytilde))
[0.04030775 0.05734435 0.01740914 0.02044029 0.02486199 0.03573222
0.00752453 0.0573522 0.00102185 0.04459142 0.03221764 0.00342095
0.01898739 0.03160225 0.00450564 0.01979508 0.00152143 0.00857666
0.00762566 0.00703236 0.00490815 0.02171406 0.04265231 0.03063862
0.0273619 0.00668013 0.03013331 0.00788426 0.01275493 0.01698978
0.02715714 0.0164838 0.00424614 0.06575617 0.03937194 0.0459393
0.03321665 0.00857944 0.01458459 0.03839052 0.04106931 0.01962178
0.01758251 0.06598382 0.01262347 0.05373197 0.00508233 0.02902908
0.01284702 0.02266783 0.05481819 0.02857842 0.00817663 0.00478164
0.02844986 0.03430817 0.00845377 0.01544721 0.00867529 0.01881047
0.02900428 0.00525951 0.00516358 0.05597496 0.03021621 0.00458845
0.0134627 0.00244413 0.00194759 0.04527404 0.02602746 0.00545794
0.00747748 0.01109822 0.08180185 0.02539315 0.02321685 0.02859203
0.01136476 0.00341608 0.00341404 0.00651846 0.05339963 0.01042549
0.02637872 0.00639212 0.02177328 0.02384275 0.00133169 0.00699813
0.0205469 0.04806452 0.01573671 0.05430636 0.01207681 0.0160536
0.01157049 0.08333949 0.01405686 0.00212278]
Splitting our Data in Training and Test data#
It is normal in essentially all Machine Learning studies to split the
data in a training set and a test set (sometimes also an additional
validation set). Scikit-Learn has an own function for this. There
is no explicit recipe for how much data should be included as training
data and say test data. An accepted rule of thumb is to use
approximately
The complete code with a simple data set#
%matplotlib inline
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
def R2(y_data, y_model):
return 1 - np.sum((y_data - y_model) ** 2) / np.sum((y_data - np.mean(y_data)) ** 2)
def MSE(y_data,y_model):
n = np.size(y_model)
return np.sum((y_data-y_model)**2)/n
x = np.random.rand(100)
y = 2.0+5*x*x+0.1*np.random.randn(100)
# The design matrix now as function of a fourth-order polynomial
X = np.zeros((len(x),5))
X[:,0] = 1.0
X[:,1] = x
X[:,2] = x**2
X[:,3] = x**3
X[:,4] = x**4
# We split the data in test and training data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# matrix inversion to find beta
beta = np.linalg.inv(X_train.T @ X_train) @ X_train.T @ y_train
print(beta)
# and then make the prediction
ytilde = X_train @ beta
print("Training R2")
print(R2(y_train,ytilde))
print("Training MSE")
print(MSE(y_train,ytilde))
ypredict = X_test @ beta
print("Test R2")
print(R2(y_test,ypredict))
print("Test MSE")
print(MSE(y_test,ypredict))
[ 2.04491327 -0.39593882 6.26462211 -1.4899045 0.59718501]
Training R2
0.9973808396916977
Training MSE
0.005799563292953798
Test R2
0.9950846163116835
Test MSE
0.011719944818178452
Making your own test-train splitting#
# equivalently in numpy
def train_test_split_numpy(inputs, labels, train_size, test_size):
n_inputs = len(inputs)
inputs_shuffled = inputs.copy()
labels_shuffled = labels.copy()
np.random.shuffle(inputs_shuffled)
np.random.shuffle(labels_shuffled)
train_end = int(n_inputs*train_size)
X_train, X_test = inputs_shuffled[:train_end], inputs_shuffled[train_end:]
Y_train, Y_test = labels_shuffled[:train_end], labels_shuffled[train_end:]
return X_train, X_test, Y_train, Y_test
But since scikit-learn has its own function for doing this and since it interfaces easily with tensorflow and other libraries, we normally recommend using the latter functionality.
Reducing the number of degrees of freedom, overarching view#
Many Machine Learning problems involve thousands or even millions of features for each training instance. Not only does this make training extremely slow, it can also make it much harder to find a good solution, as we will see. This problem is often referred to as the curse of dimensionality. Fortunately, in real-world problems, it is often possible to reduce the number of features considerably, turning an intractable problem into a tractable one.
Later we will discuss some of the most popular dimensionality reduction techniques: the principal component analysis (PCA), Kernel PCA, and Locally Linear Embedding (LLE).
Principal component analysis and its various variants deal with the problem of fitting a low-dimensional affine subspace to a set of of data points in a high-dimensional space. With its family of methods it is one of the most used tools in data modeling, compression and visualization.
Preprocessing our data#
Before we proceed however, we will discuss how to preprocess our data. Till now and in connection with our previous examples we have not met so many cases where we are too sensitive to the scaling of our data. Normally the data may need a rescaling and/or may be sensitive to extreme values. Scaling the data renders our inputs much more suitable for the algorithms we want to employ.
For data sets gathered for real world applications, it is rather normal that
different features have very different units and
numerical scales. For example, a data set detailing health habits may include
features such as age in the range
Functionality in Scikit-Learn#
Scikit-Learn has several functions which allow us to rescale the
data, normally resulting in much better results in terms of various
accuracy scores. The StandardScaler function in Scikit-Learn
ensures that for each feature/predictor we study the mean value is
zero and the variance is one (every column in the design/feature
matrix). This scaling has the drawback that it does not ensure that
we have a particular maximum or minimum in our data set. Another
function included in Scikit-Learn is the MinMaxScaler which
ensures that all features are exactly between
More preprocessing#
The Normalizer scales each data point such that the feature vector has a euclidean length of one. In other words, it projects a data point on the circle (or sphere in the case of higher dimensions) with a radius of 1. This means every data point is scaled by a different number (by the inverse of it’s length). This normalization is often used when only the direction (or angle) of the data matters, not the length of the feature vector.
The RobustScaler works similarly to the StandardScaler in that it ensures statistical properties for each feature that guarantee that they are on the same scale. However, the RobustScaler uses the median and quartiles, instead of mean and variance. This makes the RobustScaler ignore data points that are very different from the rest (like measurement errors). These odd data points are also called outliers, and might often lead to trouble for other scaling techniques.
Frequently used scaling functions#
Many features are often scaled using standardization to improve performance. In Scikit-Learn this is given by the StandardScaler function as discussed above. It is easy however to write your own. Mathematically, this involves subtracting the mean and divide by the standard deviation over the data set, for each feature:
where
Example of own Standard scaling#
Let us consider the following vanilla example where we use both Scikit-Learn and write our own function as well. We produce a simple test design matrix with random numbers. Each column could then represent a specific feature whose mean value is subracted.
import sklearn.linear_model as skl
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, StandardScaler, Normalizer
import numpy as np
import pandas as pd
from IPython.display import display
np.random.seed(100)
# setting up a 10 x 5 matrix
rows = 10
cols = 5
X = np.random.randn(rows,cols)
XPandas = pd.DataFrame(X)
display(XPandas)
print(XPandas.mean())
print(XPandas.std())
XPandas = (XPandas -XPandas.mean())
display(XPandas)
# This option does not include the standard deviation
scaler = StandardScaler(with_std=False)
scaler.fit(X)
Xscaled = scaler.transform(X)
display(XPandas-Xscaled)
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | -1.749765 | 0.342680 | 1.153036 | -0.252436 | 0.981321 |
1 | 0.514219 | 0.221180 | -1.070043 | -0.189496 | 0.255001 |
2 | -0.458027 | 0.435163 | -0.583595 | 0.816847 | 0.672721 |
3 | -0.104411 | -0.531280 | 1.029733 | -0.438136 | -1.118318 |
4 | 1.618982 | 1.541605 | -0.251879 | -0.842436 | 0.184519 |
5 | 0.937082 | 0.731000 | 1.361556 | -0.326238 | 0.055676 |
6 | 0.222400 | -1.443217 | -0.756352 | 0.816454 | 0.750445 |
7 | -0.455947 | 1.189622 | -1.690617 | -1.356399 | -1.232435 |
8 | -0.544439 | -0.668172 | 0.007315 | -0.612939 | 1.299748 |
9 | -1.733096 | -0.983310 | 0.357508 | -1.613579 | 1.470714 |
0 -0.175300
1 0.083527
2 -0.044334
3 -0.399836
4 0.331939
dtype: float64
0 1.069584
1 0.965548
2 1.018232
3 0.793167
4 0.918992
dtype: float64
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | -1.574465 | 0.259153 | 1.197370 | 0.147400 | 0.649382 |
1 | 0.689519 | 0.137652 | -1.025709 | 0.210340 | -0.076938 |
2 | -0.282727 | 0.351636 | -0.539261 | 1.216683 | 0.340782 |
3 | 0.070889 | -0.614808 | 1.074067 | -0.038300 | -1.450257 |
4 | 1.794282 | 1.458078 | -0.207545 | -0.442600 | -0.147420 |
5 | 1.112383 | 0.647473 | 1.405890 | 0.073598 | -0.276263 |
6 | 0.397700 | -1.526744 | -0.712018 | 1.216290 | 0.418506 |
7 | -0.280647 | 1.106095 | -1.646283 | -0.956563 | -1.564374 |
8 | -0.369139 | -0.751699 | 0.051649 | -0.213103 | 0.967809 |
9 | -1.557795 | -1.066837 | 0.401842 | -1.213743 | 1.138775 |
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
2 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
3 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
4 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
6 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
7 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
8 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
9 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
Small exercise: perform the standard scaling by including the standard deviation and compare with what Scikit-Learn gives.
Min-Max Scaling#
Another commonly used scaling method is min-max scaling. This is very
useful for when we want the features to lie in a certain interval. To
scale the feature
where
Testing the Means Squared Error as function of Complexity#
One of the aims is to reproduce Figure 2.11 of Hastie et al.
Our data is defined by
np.random.seed()
n = 100
maxdegree = 14
# Make data set.
x = np.linspace(-3, 3, n).reshape(-1, 1)
y = np.exp(-x**2) + 1.5 * np.exp(-(x-2)**2)+ np.random.normal(0, 0.1, x.shape)
where
Write a first code which sets up a design matrix
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
np.random.seed(2018)
n = 50
maxdegree = 5
# Make data set.
x = np.linspace(-3, 3, n).reshape(-1, 1)
y = np.exp(-x**2) + 1.5 * np.exp(-(x-2)**2)+ np.random.normal(0, 0.1, x.shape)
TestError = np.zeros(maxdegree)
TrainError = np.zeros(maxdegree)
polydegree = np.zeros(maxdegree)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
scaler = StandardScaler()
scaler.fit(x_train)
x_train_scaled = scaler.transform(x_train)
x_test_scaled = scaler.transform(x_test)
for degree in range(maxdegree):
model = make_pipeline(PolynomialFeatures(degree=degree), LinearRegression(fit_intercept=False))
clf = model.fit(x_train_scaled,y_train)
y_fit = clf.predict(x_train_scaled)
y_pred = clf.predict(x_test_scaled)
polydegree[degree] = degree
TestError[degree] = np.mean( np.mean((y_test - y_pred)**2) )
TrainError[degree] = np.mean( np.mean((y_train - y_fit)**2) )
plt.plot(polydegree, TestError, label='Test Error')
plt.plot(polydegree, TrainError, label='Train Error')
plt.legend()
plt.show()

More preprocessing examples, two-dimensional example, the Franke function#
# Common imports
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn.linear_model as skl
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, StandardScaler, Normalizer
# Where to save the figures and data files
PROJECT_ROOT_DIR = "Results"
FIGURE_ID = "Results/FigureFiles"
DATA_ID = "DataFiles/"
if not os.path.exists(PROJECT_ROOT_DIR):
os.mkdir(PROJECT_ROOT_DIR)
if not os.path.exists(FIGURE_ID):
os.makedirs(FIGURE_ID)
if not os.path.exists(DATA_ID):
os.makedirs(DATA_ID)
def image_path(fig_id):
return os.path.join(FIGURE_ID, fig_id)
def data_path(dat_id):
return os.path.join(DATA_ID, dat_id)
def save_fig(fig_id):
plt.savefig(image_path(fig_id) + ".png", format='png')
def FrankeFunction(x,y):
term1 = 0.75*np.exp(-(0.25*(9*x-2)**2) - 0.25*((9*y-2)**2))
term2 = 0.75*np.exp(-((9*x+1)**2)/49.0 - 0.1*(9*y+1))
term3 = 0.5*np.exp(-(9*x-7)**2/4.0 - 0.25*((9*y-3)**2))
term4 = -0.2*np.exp(-(9*x-4)**2 - (9*y-7)**2)
return term1 + term2 + term3 + term4
def create_X(x, y, n ):
if len(x.shape) > 1:
x = np.ravel(x)
y = np.ravel(y)
N = len(x)
l = int((n+1)*(n+2)/2) # Number of elements in beta
X = np.ones((N,l))
for i in range(1,n+1):
q = int((i)*(i+1)/2)
for k in range(i+1):
X[:,q+k] = (x**(i-k))*(y**k)
return X
# Making meshgrid of datapoints and compute Franke's function
n = 5
N = 1000
x = np.sort(np.random.uniform(0, 1, N))
y = np.sort(np.random.uniform(0, 1, N))
z = FrankeFunction(x, y)
X = create_X(x, y, n=n)
# split in training and test data
X_train, X_test, y_train, y_test = train_test_split(X,z,test_size=0.2)
clf = skl.LinearRegression().fit(X_train, y_train)
# The mean squared error and R2 score
print("MSE before scaling: {:.2f}".format(mean_squared_error(clf.predict(X_test), y_test)))
print("R2 score before scaling {:.2f}".format(clf.score(X_test,y_test)))
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
print("Feature min values before scaling:\n {}".format(X_train.min(axis=0)))
print("Feature max values before scaling:\n {}".format(X_train.max(axis=0)))
print("Feature min values after scaling:\n {}".format(X_train_scaled.min(axis=0)))
print("Feature max values after scaling:\n {}".format(X_train_scaled.max(axis=0)))
clf = skl.LinearRegression().fit(X_train_scaled, y_train)
print("MSE after scaling: {:.2f}".format(mean_squared_error(clf.predict(X_test_scaled), y_test)))
print("R2 score for scaled data: {:.2f}".format(clf.score(X_test_scaled,y_test)))
MSE before scaling: 0.00
R2 score before scaling 1.00
Feature min values before scaling:
[1.00000000e+00 6.97906022e-03 2.43639284e-03 4.87072815e-05
1.70037324e-05 5.93601008e-06 3.39931051e-07 1.18670072e-07
4.14277718e-08 1.44624525e-08 2.37239927e-09 8.28205578e-10
2.89126914e-10 1.00934327e-10 3.52362157e-11 1.65571174e-11
5.78009660e-12 2.01783414e-12 7.04426744e-13 2.45915671e-13
8.58492636e-14]
Feature max values before scaling:
[1. 0.99970894 0.99978365 0.99941797 0.99949266 0.99956735
0.99912709 0.99920175 0.99927642 0.9993511 0.99883628 0.99891093
0.99898558 0.99906023 0.99913489 0.99854557 0.99862019 0.99869482
0.99876945 0.99884409 0.99891873]
Feature min values after scaling:
[ 0. -1.71761101 -1.75770568 -1.12330033 -1.12591227 -1.12871842
-0.88613493 -0.884669 -0.88323026 -0.88182591 -0.75269037 -0.75050135
-0.74829661 -0.74607851 -0.74384949 -0.6652177 -0.66294408 -0.66064822
-0.65833132 -0.65599456 -0.6536392 ]
Feature max values after scaling:
[0. 1.71737253 1.75576555 2.20916295 2.23971032 2.26995402
2.60543038 2.63162342 2.65743689 2.68286725 2.94273542 2.96631321
2.98947894 3.01222822 3.034557 3.24159785 3.26297455 3.28391978
3.30442964 3.32450054 3.34412923]
MSE after scaling: 0.00
R2 score for scaled data: 1.00
To think about, first part#
When you are comparing your own code with for example Scikit-Learn’s library, there are some technicalities to keep in mind. The examples here demonstrate some of these aspects with potential pitfalls.
The discussion here focuses on the role of the intercept, how we can set up the design matrix, what scaling we should use and other topics which tend confuse us.
The intercept can be interpreted as the expected value of our
target/output variables when all other predictors are set to zero.
Thus, if we cannot assume that the expected outputs/targets are zero
when all predictors are zero (the columns in the design matrix), it
may be a bad idea to implement a model which penalizes the intercept.
Furthermore, in for example Ridge and Lasso regression (to be discussed in moe detail next week), the default solutions
from the library Scikit-Learn (when not shrinking
More thinking#
If our predictors represent different scales, then it is important to
standardize the design matrix
The Standadscaler function in Scikit-Learn does this for us. For the data sets we have been studying in our various examples, the data are in many cases already scaled and there is no need to scale them. You as a user of different machine learning algorithms, should always perform a survey of your data, with a critical assessment of them in case you need to scale the data.
If you need to scale the data, not doing so will give an unfair penalization of the parameters since their magnitude depends on the scale of their corresponding predictor.
Suppose as an example that you you have an input variable given by the heights of different persons. Human height might be measured in inches or meters or kilometers. If measured in kilometers, a standard linear regression model with this predictor would probably give a much bigger coefficient term, than if measured in millimeters. This can clearly lead to problems in evaluating the cost/loss functions.
Still thinking#
Keep in mind that when you transform your data set before training a model, the same transformation needs to be done on your eventual new data set before making a prediction. If we translate this into a Python code, it would could be implemented as follows (note that the lines are commented since the model function has not been defined)
#Model training, we compute the mean value of y and X
y_train_mean = np.mean(y_train)
X_train_mean = np.mean(X_train,axis=0)
X_train = X_train - X_train_mean
y_train = y_train - y_train_mean
# The we fit our model with the training data
#trained_model = some_model.fit(X_train,y_train)
#Model prediction, we need also to transform our data set used for the prediction.
X_test = X_test - X_train_mean #Use mean from training data
#y_pred = trained_model(X_test)
y_pred = y_pred + y_train_mean
What does centering (subtracting the mean values) mean mathematically?#
Let us try to understand what this may imply mathematically when we subtract the mean values, also known as zero centering. For simplicity, we will focus on ordinary regression, as done in the above example.
The cost/loss function for regression is
Recall also that we use the squared value since this leads to an increase of the penalty for higher differences between predicted and output/target values.
What we have done is to single out the
for all
Multiplying away the constant
Further Manipulations#
Let us special first to the case where we have only two parameters
We obtain then
If we define
and if we define the mean value of the outputs as
we have
In the general case, that is we have more parameters than
Replacing
Wrapping it up#
If we minimize with respect to
where
For Ridge regression we need to add
What does this mean? And why do we insist on all this? Let us look at some examples.
Linear Regression code, Intercept handling first#
This code shows a simple first-order fit to a data set using the above transformed data, where we consider the role of the intercept first, by either excluding it or including it (code example thanks to Øyvind Sigmundson Schøyen). Here our scaling of the data is done by subtracting the mean values only. Note also that we do not split the data into training and test.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
np.random.seed(2021)
def MSE(y_data,y_model):
n = np.size(y_model)
return np.sum((y_data-y_model)**2)/n
def fit_beta(X, y):
return np.linalg.pinv(X.T @ X) @ X.T @ y
true_beta = [2, 0.5, 3.7]
x = np.linspace(0, 1, 11)
y = np.sum(
np.asarray([x ** p * b for p, b in enumerate(true_beta)]), axis=0
) + 0.1 * np.random.normal(size=len(x))
degree = 3
X = np.zeros((len(x), degree))
# Include the intercept in the design matrix
for p in range(degree):
X[:, p] = x ** p
beta = fit_beta(X, y)
# Intercept is included in the design matrix
skl = LinearRegression(fit_intercept=False).fit(X, y)
print(f"True beta: {true_beta}")
print(f"Fitted beta: {beta}")
print(f"Sklearn fitted beta: {skl.coef_}")
ypredictOwn = X @ beta
ypredictSKL = skl.predict(X)
print(f"MSE with intercept column")
print(MSE(y,ypredictOwn))
print(f"MSE with intercept column from SKL")
print(MSE(y,ypredictSKL))
plt.figure()
plt.scatter(x, y, label="Data")
plt.plot(x, X @ beta, label="Fit")
plt.plot(x, skl.predict(X), label="Sklearn (fit_intercept=False)")
# Do not include the intercept in the design matrix
X = np.zeros((len(x), degree - 1))
for p in range(degree - 1):
X[:, p] = x ** (p + 1)
# Intercept is not included in the design matrix
skl = LinearRegression(fit_intercept=True).fit(X, y)
# Use centered values for X and y when computing coefficients
y_offset = np.average(y, axis=0)
X_offset = np.average(X, axis=0)
beta = fit_beta(X - X_offset, y - y_offset)
intercept = np.mean(y_offset - X_offset @ beta)
print(f"Manual intercept: {intercept}")
print(f"Fitted beta (wiothout intercept): {beta}")
print(f"Sklearn intercept: {skl.intercept_}")
print(f"Sklearn fitted beta (without intercept): {skl.coef_}")
ypredictOwn = X @ beta
ypredictSKL = skl.predict(X)
print(f"MSE with Manual intercept")
print(MSE(y,ypredictOwn+intercept))
print(f"MSE with Sklearn intercept")
print(MSE(y,ypredictSKL))
plt.plot(x, X @ beta + intercept, "--", label="Fit (manual intercept)")
plt.plot(x, skl.predict(X), "--", label="Sklearn (fit_intercept=True)")
plt.grid()
plt.legend()
plt.show()
True beta: [2, 0.5, 3.7]
Fitted beta: [2.08376632 0.19569961 3.97898392]
Sklearn fitted beta: [2.08376632 0.19569961 3.97898392]
MSE with intercept column
0.00411363461744314
MSE with intercept column from SKL
0.004113634617443147
Manual intercept: 2.083766322923899
Fitted beta (wiothout intercept): [0.19569961 3.97898392]
Sklearn intercept: 2.0837663229239043
Sklearn fitted beta (without intercept): [0.19569961 3.97898392]
MSE with Manual intercept
0.00411363461744314
MSE with Sklearn intercept
0.004113634617443131

The intercept is the value of our output/target variable
when all our features are zero and our function crosses the
Printing the MSE, we see first that both methods give the same MSE, as
they should. However, when we move to for example Ridge regression (discussed next week),
the way we treat the intercept may give a larger or smaller MSE,
meaning that the MSE can be penalized by the value of the
intercept. Not including the intercept in the fit, means that the
regularization term does not include
To remind the reader, the regularization term, with the intercept in Ridge regression is given by
but when we take out the intercept, this equation becomes
For Lasso regression we have
It means that, when scaling the design matrix and the outputs/targets, by subtracting the mean values, we have an optimization problem which is not penalized by the intercept. The MSE value can then be smaller since it focuses only on the remaining quantities. If we however bring back the intercept, we will get an MSE which then contains the intercept. This becomes more important when we discuss Ridge and Lasso regression next week.
Material for lecture Monday, August 26#
Mathematical Interpretation of Ordinary Least Squares#
What is presented here is a mathematical analysis of various regression algorithms (ordinary least squares, Ridge and Lasso Regression). The analysis is based on an important algorithm in linear algebra, the so-called Singular Value Decomposition (SVD).
We have shown that in ordinary least squares the optimal parameters
The hat over
This means that our best model is defined as
We now define a matrix
We can rewrite
The matrix
Residual Error#
We have defined the residual error as
The residual errors are then the projections of
Simple case#
If the matrix
In this case the matrix
and we have the obvious case
This serves also as a useful test of our codes.
The singular value decomposition#
The examples we have looked at so far are cases where we normally can
invert the matrix
As we will also see in the first project, this may however not the be case in general and a standard matrix inversion algorithm based on say LU, QR or Cholesky decomposition may lead to singularities. We will see examples of this below.
There is however a way to circumvent this problem and also gain some insights about the ordinary least squares approach, and later shrinkage methods like Ridge and Lasso regressions.
This is given by the Singular Value Decomposition (SVD) algorithm, perhaps the most powerful linear algebra algorithm. The SVD provides a numerically stable matrix decomposition that is used in a large swath oc applications and the decomposition is always stable numerically.
In machine learning it plays a central role in dealing with for example design matrices that may be near singular or singular. Furthermore, as we will see here, the singular values can be related to the covariance matrix (and thereby the correlation matrix) and in turn the variance of a given quantity. It plays also an important role in the principal component analysis where high-dimensional data can be reduced to the statistically relevant features.
Linear Regression Problems#
One of the typical problems we encounter with linear regression, in particular
when the matrix
This means that the matrix may be rank deficient and it is basically impossible to
to model the data using linear regression. As an example, consider the matrix
The columns of
Super-collinearity of an
We see easily that
Fixing the singularity#
If our design matrix
has linearly dependent column vectors, we will not be able to compute the inverse
of
A cheap ad hoc approach is simply to add a small diagonal component to the matrix to invert, that is we change
where
Basic math of the SVD#
From standard linear algebra we know that a square matrix
and the eigenvalues are given by the diagonal matrix
The matrix
with
Not all square matrices are diagonalizable. A matrix like the one discussed above
is not diagonalizable, it is a so-called defective matrix. It is easy to see that the condition
The SVD, a Fantastic Algorithm#
However, and this is the strength of the SVD algorithm, any general
matrix
As an example, the above defective matrix can be decomposed as
with eigenvalues
The SVD
decomposition (singular values) gives eigenvalues
In the general case, where our design matrix
The columns of
Economy-size SVD#
If we assume that
The economy-size decomposition removes extra rows or columns of zeros
from the diagonal matrix of singular values,
If
Codes for the SVD#
import numpy as np
# SVD inversion
def SVD(A):
''' Takes as input a numpy matrix A and returns inv(A) based on singular value decomposition (SVD).
SVD is numerically more stable than the inversion algorithms provided by
numpy and scipy.linalg at the cost of being slower.
'''
U, S, VT = np.linalg.svd(A,full_matrices=True)
print('test U')
print( (np.transpose(U) @ U - U @np.transpose(U)))
print('test VT')
print( (np.transpose(VT) @ VT - VT @np.transpose(VT)))
print(U)
print(S)
print(VT)
D = np.zeros((len(U),len(VT)))
for i in range(0,len(VT)):
D[i,i]=S[i]
return U @ D @ VT
X = np.array([ [1.0,-1.0], [1.0,-1.0]])
#X = np.array([[1, 2], [3, 4], [5, 6]])
print(X)
C = SVD(X)
# Print the difference between the original matrix and the SVD one
print(C-X)
[[ 1. -1.]
[ 1. -1.]]
test U
[[0. 0.]
[0. 0.]]
test VT
[[0. 0.]
[0. 0.]]
[[-0.70710678 -0.70710678]
[-0.70710678 0.70710678]]
[2.00000000e+00 3.35470445e-17]
[[-0.70710678 0.70710678]
[ 0.70710678 0.70710678]]
[[-3.33066907e-16 4.44089210e-16]
[ 0.00000000e+00 2.22044605e-16]]
The matrix
Note about SVD Calculations#
The
As you can see from the code, the
If you wish to include the zero singular values, you will need to resize the matrices and set up a diagonal matrix as done in the above example
Mathematics of the SVD and implications#
Let us take a closer look at the mathematics of the SVD and the various implications for machine learning studies.
Our starting point is our design matrix
We can SVD decompose our matrix as
where
Similarly,
Finally
All values beyond
Example Matrix#
As an example, consider the following
The singular values are
where
contains only the singular values. Note also (and we will use this below) that
which is a
is a
Setting up the Matrix to be inverted#
The matrix that may cause problems for us is
and using the orthogonality of the matrix
We define
We can now insert the result for the matrix
and using our SVD decomposition of
which gives us, using the orthogonality of the matrix
It means that the ordinary least square model (with the optimal
parameters)
Further properties (important for our analyses later)#
Let us study again
If we now multiply from the right with
This means the vectors
Similarly, if we use the SVD decomposition for the matrix
If we now multiply from the right with
This means the vectors
Important note: we have defined our design matrix
In our lectures, the number of columns will always refer to the number of features in our data set, while the number of rows represents the number of data inputs. Note that in other texts you may find the opposite notation. This has consequences for the definition of for example the covariance matrix and its relation to the SVD.
Meet the Covariance Matrix#
Before we move on to a discussion of Ridge and Lasso regression, we want to show an important example of the above.
We have already noted that the matrix
This quantity defines was what is called the Hessian matrix (the second derivative of a function we want to optimize).
The Hessian matrix plays an important role and is defined in this course as
The Hessian matrix for ordinary least squares is also proportional to the covariance matrix. This means also that we can use the SVD to find the eigenvalues of the covariance matrix and the Hessian matrix in terms of the singular values. Let us develop these arguments, as they will play an important role in our machine learning studies.
Introducing the Covariance and Correlation functions#
Before we discuss the link between for example Ridge regression and the singular value decomposition, we need to remind ourselves about the definition of the covariance and the correlation function. These are quantities that play a central role in machine learning methods.
Suppose we have defined two vectors
where for example
With this definition and recalling that the variance is defined as
we can rewrite the covariance matrix as
Note: we have used
Covariance and Correlation Matrix#
The covariance takes values between zero and infinity and may thus lead to problems with loss of numerical precision for particularly large values. It is common to scale the covariance matrix by introducing instead the correlation matrix defined via the so-called correlation function
The correlation function is then given by values
In the above example this is the function we constructed using pandas.
Correlation Function and Design/Feature Matrix#
In our derivation of the various regression algorithms like Ordinary Least Squares or Ridge regression
we defined the design/feature matrix
with
with a given vector
With these definitions, we can now rewrite our
and the correlation matrix
Covariance Matrix Examples#
The Numpy function np.cov calculates the covariance elements using
the factor
Note that this assumes you have the features as the rows, and the inputs as columns, that is
which in turn is converted into into the
# Importing various packages
import numpy as np
n = 100
x = np.random.normal(size=n)
print(np.mean(x))
y = 4+3*x+np.random.normal(size=n)
print(np.mean(y))
W = np.vstack((x, y))
C = np.cov(W)
print(C)
0.10790125813226321
4.340071371496255
[[ 1.04193203 3.08165104]
[ 3.08165104 10.18383522]]
Correlation Matrix#
The previous example can be converted into the correlation matrix by
simply scaling the matrix elements with the variances. We should also
subtract the mean values for each column. This leads to the following
code which sets up the correlations matrix for the previous example in
a more brute force way. Here we scale the mean values for each column of the design matrix, calculate the relevant mean values and variances and then finally set up the
import numpy as np
n = 100
# define two vectors
x = np.random.random(size=n)
y = 4+3*x+np.random.normal(size=n)
#scaling the x and y vectors
x = x - np.mean(x)
y = y - np.mean(y)
variance_x = np.sum(x@x)/n
variance_y = np.sum(y@y)/n
print(variance_x)
print(variance_y)
cov_xy = np.sum(x@y)/n
cov_xx = np.sum(x@x)/n
cov_yy = np.sum(y@y)/n
C = np.zeros((2,2))
C[0,0]= cov_xx/variance_x
C[1,1]= cov_yy/variance_y
C[0,1]= cov_xy/np.sqrt(variance_y*variance_x)
C[1,0]= C[0,1]
print(C)
0.08881497884574564
1.7086067479626619
[[1. 0.66080313]
[0.66080313 1. ]]
We see that the matrix elements along the diagonal are one as they should be and that the matrix is symmetric. Furthermore, diagonalizing this matrix we easily see that it is a positive definite matrix.
The above procedure with numpy can be made more compact if we use pandas.
Correlation Matrix with Pandas#
We whow here how we can set up the correlation matrix using pandas, as done in this simple code
import numpy as np
import pandas as pd
n = 10
x = np.random.normal(size=n)
x = x - np.mean(x)
y = 4+3*x+np.random.normal(size=n)
y = y - np.mean(y)
# Note that we transpose the matrix in order to stay with our ordering n x p
X = (np.vstack((x, y))).T
print(X)
Xpd = pd.DataFrame(X)
print(Xpd)
correlation_matrix = Xpd.corr()
print(correlation_matrix)
[[-0.40620066 -2.01265755]
[ 0.01458611 0.37737221]
[-1.0895387 -3.65442354]
[ 0.2338675 1.12044974]
[ 0.4676059 1.54393936]
[-0.65891389 -3.16304863]
[-0.1715252 0.39197698]
[ 0.71142161 2.95511792]
[ 0.39214397 0.13069442]
[ 0.50655336 2.3105791 ]]
0 1
0 -0.406201 -2.012658
1 0.014586 0.377372
2 -1.089539 -3.654424
3 0.233868 1.120450
4 0.467606 1.543939
5 -0.658914 -3.163049
6 -0.171525 0.391977
7 0.711422 2.955118
8 0.392144 0.130694
9 0.506553 2.310579
0 1
0 1.000000 0.952387
1 0.952387 1.000000
We expand this model to the Franke function discussed above.
Correlation Matrix with Pandas and the Franke function#
# Common imports
import numpy as np
import pandas as pd
def FrankeFunction(x,y):
term1 = 0.75*np.exp(-(0.25*(9*x-2)**2) - 0.25*((9*y-2)**2))
term2 = 0.75*np.exp(-((9*x+1)**2)/49.0 - 0.1*(9*y+1))
term3 = 0.5*np.exp(-(9*x-7)**2/4.0 - 0.25*((9*y-3)**2))
term4 = -0.2*np.exp(-(9*x-4)**2 - (9*y-7)**2)
return term1 + term2 + term3 + term4
def create_X(x, y, n ):
if len(x.shape) > 1:
x = np.ravel(x)
y = np.ravel(y)
N = len(x)
l = int((n+1)*(n+2)/2) # Number of elements in beta
X = np.ones((N,l))
for i in range(1,n+1):
q = int((i)*(i+1)/2)
for k in range(i+1):
X[:,q+k] = (x**(i-k))*(y**k)
return X
# Making meshgrid of datapoints and compute Franke's function
n = 4
N = 100
x = np.sort(np.random.uniform(0, 1, N))
y = np.sort(np.random.uniform(0, 1, N))
z = FrankeFunction(x, y)
X = create_X(x, y, n=n)
Xpd = pd.DataFrame(X)
# subtract the mean values and set up the covariance matrix
Xpd = Xpd - Xpd.mean()
covariance_matrix = Xpd.cov()
print(covariance_matrix)
0 1 2 3 4 5 6 7 \
0 0.0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
1 0.0 0.078974 0.081276 0.075889 0.077168 0.078540 0.065410 0.066474
2 0.0 0.081276 0.084076 0.078336 0.079946 0.081655 0.067707 0.069028
3 0.0 0.075889 0.078336 0.077460 0.078986 0.080616 0.069452 0.070737
4 0.0 0.077168 0.079946 0.078986 0.080764 0.082653 0.070986 0.072476
5 0.0 0.078540 0.081655 0.080616 0.082653 0.084809 0.072621 0.074323
6 0.0 0.065410 0.067707 0.069452 0.070986 0.072621 0.064074 0.065378
7 0.0 0.066474 0.069028 0.070737 0.072476 0.074323 0.065378 0.066854
8 0.0 0.067637 0.070457 0.072132 0.074084 0.076150 0.066787 0.068441
9 0.0 0.068906 0.072000 0.073644 0.075816 0.078110 0.068307 0.070146
10 0.0 0.055734 0.057835 0.060872 0.062337 0.063894 0.057393 0.058645
11 0.0 0.056683 0.058996 0.062016 0.063653 0.065390 0.058552 0.059951
12 0.0 0.057722 0.060254 0.063260 0.065077 0.066999 0.059807 0.061359
13 0.0 0.058854 0.061614 0.064609 0.066612 0.068727 0.061163 0.062874
14 0.0 0.060083 0.063080 0.066066 0.068264 0.070582 0.062624 0.064501
8 9 10 11 12 13 14
0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
1 0.067637 0.068906 0.055734 0.056683 0.057722 0.058854 0.060083
2 0.070457 0.072000 0.057835 0.058996 0.060254 0.061614 0.063080
3 0.072132 0.073644 0.060872 0.062016 0.063260 0.064609 0.066066
4 0.074084 0.075816 0.062337 0.063653 0.065077 0.066612 0.068264
5 0.076150 0.078110 0.063894 0.065390 0.066999 0.068727 0.070582
6 0.066787 0.068307 0.057393 0.058552 0.059807 0.061163 0.062624
7 0.068441 0.070146 0.058645 0.059951 0.061359 0.062874 0.064501
8 0.070213 0.072111 0.059993 0.061452 0.063019 0.064699 0.066500
9 0.072111 0.074210 0.061443 0.063061 0.064793 0.066647 0.068629
10 0.059993 0.061443 0.052305 0.053417 0.054617 0.055910 0.057300
11 0.061452 0.063061 0.053417 0.054655 0.055987 0.057418 0.058952
12 0.063019 0.064793 0.054617 0.055987 0.057457 0.059031 0.060716
13 0.064699 0.066647 0.055910 0.057418 0.059031 0.060756 0.062599
14 0.066500 0.068629 0.057300 0.058952 0.060716 0.062599 0.064606
We note here that the covariance is zero for the first rows and
columns since all matrix elements in the design matrix were set to one
(we are fitting the function in terms of a polynomial of degree
This means that the variance for these elements will be zero and will cause problems when we set up the correlation matrix. We can simply drop these elements and construct a correlation matrix without these elements.
Rewriting the Covariance and/or Correlation Matrix#
We can rewrite the covariance matrix in a more compact form in terms of the design/feature matrix
To see this let us simply look at a design matrix
If we then compute the expectation value (note the
which is just
where we wrote $
It is easy to generalize this to a matrix
Linking with the SVD#
We saw earlier that
Since the matrices here have dimension
where the tilde-matrix
meaning we can write
Multiplying from the right with
What does it mean?#
This means the vectors
In other words, each non-zero singular value of
Note that these are also the eigenvectors and eigenvalues of the Hessian matrix. Note also that the Hessian matrix we are discussing here is from a cost function defined by the mean squared error only.
If we now recall the definition of the covariance matrix (not using Bessel’s correction) we have
meaning that every squared non-singular value of
And finally #
For
Since the matrices here have dimension
leading to
Multiplying with
It means that the eigenvalues of
Since we will mainly be interested in the correlations among the features
of our data (the columns of
Ridge and LASSO Regression#
Let us remind ourselves about the expression for the standard Mean Squared Error (MSE) which we used to define our cost function and the equations for the ordinary least squares (OLS) method, that is our optimization problem is
or we can state it as
where we have used the definition of a norm-2 vector, that is
By minimizing the above equation with respect to the parameters
which leads to the Ridge regression minimization problem where we
require that
we have a new optimization equation
which leads to Lasso regression. Lasso stands for least absolute shrinkage and selection operator.
Here we have defined the norm-1 as
Deriving the Ridge Regression Equations#
Using the matrix-vector expression for Ridge regression and dropping the parameter
and
taking the derivatives with respect to
with
with
If we keep the
In many textbooks the
When we compare this with the ordinary least squares result we have
which can lead to singular matrices. However, with the SVD, we can always compute the inverse of the matrix
We see that Ridge regression is nothing but the standard OLS with a
modified diagonal term added to
Using our insights about the SVD of the design matrix
For Ridge regression this becomes
with the vectors
Interpreting the Ridge results#
Since
Ridge regression finds the coordinates of
For small eigenvalues
More interpretations#
For the sake of simplicity, let us assume that the design matrix is orthonormal, that is
In this case the standard OLS results in
and
that is the Ridge estimator scales the OLS estimator by the inverse of a factor
We will come back to more interpreations after we have gone through some of the statistical analysis part.
For more discussions of Ridge and Lasso regression, Wessel van Wieringen’s article is highly recommended. Similarly, Mehta et al’s article is also recommended.
Deriving the Lasso Regression Equations#
Using the matrix-vector expression for Lasso regression, we have the following cost function
Taking the derivative with respect to
we have that the derivative of the cost function is
and reordering we have
We can redefine
This equation does not lead to a nice analytical equation as in either Ridge regression or ordinary least squares. This equation can however be solved by using standard convex optimization algorithms using for example the Python package CVXOPT. We will discuss this later.