Linear Regression and more Advanced Regression Analysis
Contents
10. Linear Regression and more Advanced Regression Analysis¶
10.1. Why Linear Regression (aka Ordinary Least Squares and family)¶
Fitting a continuous function with linear parameterization in terms of the parameters
Method of choice for fitting a continuous function!
Gives an excellent introduction to central Machine Learning features with understandable pedagogical links to other methods like Neural Networks, Support Vector Machines etc
Analytical expression for the fitting parameters
Analytical expressions for statistical propertiers like mean values, variances, confidence intervals and more
Analytical relation with probabilistic interpretations
Easy to introduce basic concepts like bias-variance tradeoff, cross-validation, resampling and regularization techniques and many other ML topics
Easy to code! And links well with classification problems and logistic regression and neural networks
Allows for easy hands-on understanding of gradient descent methods
and many more features
For more discussions of Ridge and Lasso regression, Wessel van Wieringen’s article is highly recommended. Similarly, Mehta et al’s article is also recommended.
10.2. Regression analysis, overarching aims¶
Regression modeling deals with the description of the sampling distribution of a given random variable
A regression model aims at finding a likelihood function
casesResponse (target, dependent or outcome) variable
with so-called explanatory (independent or predictor) variables with and explanatory variables running from to . See below for more explicit examples.
The goal of the regression analysis is to extract/exploit relationship between
10.3. Regression analysis, overarching aims II¶
Consider an experiment in which
The matrix
Linear regression gives us a set of analytical equations for the parameters
10.4. Examples¶
In order to understand the relation among the predictors
There we assumed that we could parametrize the data using a polynomial approximation based on the liquid drop model. Assuming
we have five predictors, that is the intercept, the
Here the predictors are based on a model we have made. A popular data set which is widely encountered in ML applications is the
so-called credit card default data from Taiwan. The data set contains data on
10.5. General linear models¶
Before we proceed let us study a case from linear algebra where we aim at fitting a set of data
Since obtaining these data points may not be trivial, we want to use these data to fit a function which can allow us to make predictions for values of
where
10.6. Rewriting the fitting procedure as a linear algebra problem¶
For every set of values
10.7. Rewriting the fitting procedure as a linear algebra problem, more details¶
Defining the vectors
and
and
and the design matrix
we can rewrite our equations as
The above design matrix is called a Vandermonde matrix.
10.8. Generalizing the fitting procedure as a linear algebra problem¶
We are obviously not limited to the above polynomial expansions. We
could replace the various powers of
Note that we have
10.9. Generalizing the fitting procedure as a linear algebra problem¶
We redefine in turn the matrix
and without loss of generality we rewrite again our equations as
The left-hand side of this equation is kwown. Our error vector
10.10. Optimizing our parameters¶
We have defined the matrix
As we noted above, we stayed with a system with the design matrix
10.11. Our model for the nuclear binding energies¶
In our introductory notes we looked at the so-called liquid drop model. Let us remind ourselves about what we did by looking at the code.
We restate the parts of the code we are most interested in.
%matplotlib inline
# Common imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import display
import os
# Where to save the figures and data files
PROJECT_ROOT_DIR = "Results"
FIGURE_ID = "Results/FigureFiles"
DATA_ID = "DataFiles/"
if not os.path.exists(PROJECT_ROOT_DIR):
os.mkdir(PROJECT_ROOT_DIR)
if not os.path.exists(FIGURE_ID):
os.makedirs(FIGURE_ID)
if not os.path.exists(DATA_ID):
os.makedirs(DATA_ID)
def image_path(fig_id):
return os.path.join(FIGURE_ID, fig_id)
def data_path(dat_id):
return os.path.join(DATA_ID, dat_id)
def save_fig(fig_id):
plt.savefig(image_path(fig_id) + ".png", format='png')
infile = open(data_path("MassEval2016.dat"),'r')
# Read the experimental data with Pandas
Masses = pd.read_fwf(infile, usecols=(2,3,4,6,11),
names=('N', 'Z', 'A', 'Element', 'Ebinding'),
widths=(1,3,5,5,5,1,3,4,1,13,11,11,9,1,2,11,9,1,3,1,12,11,1),
header=39,
index_col=False)
# Extrapolated values are indicated by '#' in place of the decimal place, so
# the Ebinding column won't be numeric. Coerce to float and drop these entries.
Masses['Ebinding'] = pd.to_numeric(Masses['Ebinding'], errors='coerce')
Masses = Masses.dropna()
# Convert from keV to MeV.
Masses['Ebinding'] /= 1000
# Group the DataFrame by nucleon number, A.
Masses = Masses.groupby('A')
# Find the rows of the grouped DataFrame with the maximum binding energy.
Masses = Masses.apply(lambda t: t[t.Ebinding==t.Ebinding.max()])
A = Masses['A']
Z = Masses['Z']
N = Masses['N']
Element = Masses['Element']
Energies = Masses['Ebinding']
# Now we set up the design matrix X
X = np.zeros((len(A),5))
X[:,0] = 1
X[:,1] = A
X[:,2] = A**(2.0/3.0)
X[:,3] = A**(-1.0/3.0)
X[:,4] = A**(-1.0)
# Then nice printout using pandas
DesignMatrix = pd.DataFrame(X)
DesignMatrix.index = A
DesignMatrix.columns = ['1', 'A', 'A^(2/3)', 'A^(-1/3)', '1/A']
display(DesignMatrix)
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
Input In [1], in <cell line: 33>()
30 def save_fig(fig_id):
31 plt.savefig(image_path(fig_id) + ".png", format='png')
---> 33 infile = open(data_path("MassEval2016.dat"),'r')
36 # Read the experimental data with Pandas
37 Masses = pd.read_fwf(infile, usecols=(2,3,4,6,11),
38 names=('N', 'Z', 'A', 'Element', 'Ebinding'),
39 widths=(1,3,5,5,5,1,3,4,1,13,11,11,9,1,2,11,9,1,3,1,12,11,1),
40 header=39,
41 index_col=False)
FileNotFoundError: [Errno 2] No such file or directory: 'DataFiles/MassEval2016.dat'
With
throughout these lectures.
10.12. Optimizing our parameters, more details¶
With the above we use the design matrix to define the approximation
and in order to find the optimal parameters
or using the matrix
This function is one possible way to define the so-called cost function.
It is also common to define
the function
since when taking the first derivative with respect to the unknown parameters
10.13. Interpretations and optimizing our parameters¶
The function
can be linked to the variance of the quantity
where
In order to find the parameters
In practical terms it means we will require
which results in
or in a matrix-vector form as
10.14. Interpretations and optimizing our parameters¶
We can rewrite
as
and if the matrix
We note also that since our design matrix is defined as
Small question: Do you think the example we have at hand here (the nuclear binding energies) can lead to problems in inverting the matrix
10.15. Some useful matrix and vector expressions¶
The following matrix and vector relation will be useful here and for the rest of the course. Vectors are always written as boldfaced lower case letters and matrices as upper case boldfaced letters.
2 6
< < < ! ! M A T H _ B L O C K
2 7
< < < ! ! M A T H _ B L O C K
2 8
< < < ! ! M A T H _ B L O C K
10.16. Interpretations and optimizing our parameters¶
The residuals
and with
we have
meaning that the solution for
Let us now return to our nuclear binding energies and simply code the above equations.
10.17. Own code for Ordinary Least Squares¶
It is rather straightforward to implement the matrix inversion and obtain the parameters
# matrix inversion to find beta
beta = np.linalg.inv(X.T.dot(X)).dot(X.T).dot(Energies)
# and then make the prediction
ytilde = X @ beta
Alternatively, you can use the least squares functionality in Numpy as
fit = np.linalg.lstsq(X, Energies, rcond =None)[0]
ytildenp = np.dot(fit,X.T)
And finally we plot our fit with and compare with data
Masses['Eapprox'] = ytilde
# Generate a plot comparing the experimental with the fitted values values.
fig, ax = plt.subplots()
ax.set_xlabel(r'$A = N + Z$')
ax.set_ylabel(r'$E_\mathrm{bind}\,/\mathrm{MeV}$')
ax.plot(Masses['A'], Masses['Ebinding'], alpha=0.7, lw=2,
label='Ame2016')
ax.plot(Masses['A'], Masses['Eapprox'], alpha=0.7, lw=2, c='m',
label='Fit')
ax.legend()
save_fig("Masses2016OLS")
plt.show()
10.18. Adding error analysis and training set up¶
We can easily test our fit by computing the
def R2(y_data, y_model):
return 1 - np.sum((y_data - y_model) ** 2) / np.sum((y_data - np.mean(y_data)) ** 2)
and we would be using it as
print(R2(Energies,ytilde))
We can easily add our MSE score as
def MSE(y_data,y_model):
n = np.size(y_model)
return np.sum((y_data-y_model)**2)/n
print(MSE(Energies,ytilde))
and finally the relative error as
def RelativeError(y_data,y_model):
return abs((y_data-y_model)/y_data)
print(RelativeError(Energies, ytilde))
10.19. The function¶
Normally, the response (dependent or outcome) variable
Introducing the standard deviation
where the matrix
10.20. The function¶
In order to find the parameters
which results in
or in a matrix-vector form as
where we have defined the matrix
10.22. The function¶
If we then introduce the matrix
we have then the following expression for the parameters
We state without proof the expression for the uncertainty in the parameters
resulting in
10.23. The function¶
The first step here is to approximate the function
By computing the derivatives of
and
10.24. The function¶
For a linear fit (a first-order polynomial) we don’t need to invert a matrix!!
Defining
we obtain
This approach (different linear and non-linear regression) suffers
often from both being underdetermined and overdetermined in the
unknown coefficients
10.25. Fitting an Equation of State for Dense Nuclear Matter¶
Before we continue, let us introduce yet another example. We are going to fit the nuclear equation of state using results from many-body calculations. The equation of state we have made available here, as function of density, has been derived using modern nucleon-nucleon potentials with the addition of three-body forces. This time the file is presented as a standard csv file.
The beginning of the Python code here is similar to what you have seen before, with the same initializations and declarations. We use also pandas again, rather extensively in order to organize our data.
The difference now is that we use Scikit-Learn’s regression tools
instead of our own matrix inversion implementation. Furthermore, we
sneak in Ridge regression (to be discussed below) which includes a
hyperparameter
10.26. The code¶
# Common imports
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.pyplot as plt
import sklearn.linear_model as skl
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
# Where to save the figures and data files
PROJECT_ROOT_DIR = "Results"
FIGURE_ID = "Results/FigureFiles"
DATA_ID = "DataFiles/"
if not os.path.exists(PROJECT_ROOT_DIR):
os.mkdir(PROJECT_ROOT_DIR)
if not os.path.exists(FIGURE_ID):
os.makedirs(FIGURE_ID)
if not os.path.exists(DATA_ID):
os.makedirs(DATA_ID)
def image_path(fig_id):
return os.path.join(FIGURE_ID, fig_id)
def data_path(dat_id):
return os.path.join(DATA_ID, dat_id)
def save_fig(fig_id):
plt.savefig(image_path(fig_id) + ".png", format='png')
infile = open(data_path("EoS.csv"),'r')
# Read the EoS data as csv file and organize the data into two arrays with density and energies
EoS = pd.read_csv(infile, names=('Density', 'Energy'))
EoS['Energy'] = pd.to_numeric(EoS['Energy'], errors='coerce')
EoS = EoS.dropna()
Energies = EoS['Energy']
Density = EoS['Density']
# The design matrix now as function of various polytrops
X = np.zeros((len(Density),4))
X[:,3] = Density**(4.0/3.0)
X[:,2] = Density
X[:,1] = Density**(2.0/3.0)
X[:,0] = 1
# We use now Scikit-Learn's linear regressor and ridge regressor
# OLS part
clf = skl.LinearRegression().fit(X, Energies)
ytilde = clf.predict(X)
EoS['Eols'] = ytilde
# The mean squared error
print("Mean squared error: %.2f" % mean_squared_error(Energies, ytilde))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % r2_score(Energies, ytilde))
# Mean absolute error
print('Mean absolute error: %.2f' % mean_absolute_error(Energies, ytilde))
print(clf.coef_, clf.intercept_)
# The Ridge regression with a hyperparameter lambda = 0.1
_lambda = 0.1
clf_ridge = skl.Ridge(alpha=_lambda).fit(X, Energies)
yridge = clf_ridge.predict(X)
EoS['Eridge'] = yridge
# The mean squared error
print("Mean squared error: %.2f" % mean_squared_error(Energies, yridge))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % r2_score(Energies, yridge))
# Mean absolute error
print('Mean absolute error: %.2f' % mean_absolute_error(Energies, yridge))
print(clf_ridge.coef_, clf_ridge.intercept_)
fig, ax = plt.subplots()
ax.set_xlabel(r'$\rho[\mathrm{fm}^{-3}]$')
ax.set_ylabel(r'Energy per particle')
ax.plot(EoS['Density'], EoS['Energy'], alpha=0.7, lw=2,
label='Theoretical data')
ax.plot(EoS['Density'], EoS['Eols'], alpha=0.7, lw=2, c='m',
label='OLS')
ax.plot(EoS['Density'], EoS['Eridge'], alpha=0.7, lw=2, c='g',
label='Ridge $\lambda = 0.1$')
ax.legend()
save_fig("EoSfitting")
plt.show()
The above simple polynomial in density
We note also that there is a small deviation between the standard OLS and the Ridge regression at higher densities. We discuss this in more detail below.
10.27. Splitting our Data in Training and Test data¶
It is normal in essentially all Machine Learning studies to split the
data in a training set and a test set (sometimes also an additional
validation set). Scikit-Learn has an own function for this. There
is no explicit recipe for how much data should be included as training
data and say test data. An accepted rule of thumb is to use
approximately
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
# Where to save the figures and data files
PROJECT_ROOT_DIR = "Results"
FIGURE_ID = "Results/FigureFiles"
DATA_ID = "DataFiles/"
if not os.path.exists(PROJECT_ROOT_DIR):
os.mkdir(PROJECT_ROOT_DIR)
if not os.path.exists(FIGURE_ID):
os.makedirs(FIGURE_ID)
if not os.path.exists(DATA_ID):
os.makedirs(DATA_ID)
def image_path(fig_id):
return os.path.join(FIGURE_ID, fig_id)
def data_path(dat_id):
return os.path.join(DATA_ID, dat_id)
def save_fig(fig_id):
plt.savefig(image_path(fig_id) + ".png", format='png')
def R2(y_data, y_model):
return 1 - np.sum((y_data - y_model) ** 2) / np.sum((y_data - np.mean(y_data)) ** 2)
def MSE(y_data,y_model):
n = np.size(y_model)
return np.sum((y_data-y_model)**2)/n
infile = open(data_path("EoS.csv"),'r')
# Read the EoS data as csv file and organized into two arrays with density and energies
EoS = pd.read_csv(infile, names=('Density', 'Energy'))
EoS['Energy'] = pd.to_numeric(EoS['Energy'], errors='coerce')
EoS = EoS.dropna()
Energies = EoS['Energy']
Density = EoS['Density']
# The design matrix now as function of various polytrops
X = np.zeros((len(Density),5))
X[:,0] = 1
X[:,1] = Density**(2.0/3.0)
X[:,2] = Density
X[:,3] = Density**(4.0/3.0)
X[:,4] = Density**(5.0/3.0)
# We split the data in test and training data
X_train, X_test, y_train, y_test = train_test_split(X, Energies, test_size=0.2)
# matrix inversion to find beta
beta = np.linalg.inv(X_train.T.dot(X_train)).dot(X_train.T).dot(y_train)
# and then make the prediction
ytilde = X_train @ beta
print("Training R2")
print(R2(y_train,ytilde))
print("Training MSE")
print(MSE(y_train,ytilde))
ypredict = X_test @ beta
print("Test R2")
print(R2(y_test,ypredict))
print("Test MSE")
print(MSE(y_test,ypredict))
10.28. The Boston housing data example¶
The Boston housing
data set was originally a part of UCI Machine Learning Repository
and has been removed now. The data set is now included in Scikit-Learn’s
library. There are 506 samples and 13 feature (predictor) variables
in this data set. The objective is to predict the value of prices of
the house using the features (predictors) listed here.
The features/predictors are
CRIM: Per capita crime rate by town
ZN: Proportion of residential land zoned for lots over 25000 square feet
INDUS: Proportion of non-retail business acres per town
CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
NOX: Nitric oxide concentration (parts per 10 million)
RM: Average number of rooms per dwelling
AGE: Proportion of owner-occupied units built prior to 1940
DIS: Weighted distances to five Boston employment centers
RAD: Index of accessibility to radial highways
TAX: Full-value property tax rate per USD10000
B:
, where is the proportion of [people of African American descent] by townLSTAT: Percentage of lower status of the population
MEDV: Median value of owner-occupied homes in USD 1000s
10.29. Housing data, the code¶
We start by importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
and load the Boston Housing DataSet from Scikit-Learn
from sklearn.datasets import load_boston
boston_dataset = load_boston()
# boston_dataset is a dictionary
# let's check what it contains
boston_dataset.keys()
Then we invoke Pandas
boston = pd.DataFrame(boston_dataset.data, columns=boston_dataset.feature_names)
boston.head()
boston['MEDV'] = boston_dataset.target
and preprocess the data
# check for missing values in all the columns
boston.isnull().sum()
We can then visualize the data
# set the size of the figure
sns.set(rc={'figure.figsize':(11.7,8.27)})
# plot a histogram showing the distribution of the target values
sns.distplot(boston['MEDV'], bins=30)
plt.show()
It is now useful to look at the correlation matrix
# compute the pair wise correlation for all columns
correlation_matrix = boston.corr().round(2)
# use the heatmap function from seaborn to plot the correlation matrix
# annot = True to print the values inside the square
sns.heatmap(data=correlation_matrix, annot=True)
From the above coorelation plot we can see that MEDV is strongly correlated to LSTAT and RM. We see also that RAD and TAX are stronly correlated, but we don’t include this in our features together to avoid multi-colinearity
plt.figure(figsize=(20, 5))
features = ['LSTAT', 'RM']
target = boston['MEDV']
for i, col in enumerate(features):
plt.subplot(1, len(features) , i+1)
x = boston[col]
y = target
plt.scatter(x, y, marker='o')
plt.title(col)
plt.xlabel(col)
plt.ylabel('MEDV')
Now we start training our model
X = pd.DataFrame(np.c_[boston['LSTAT'], boston['RM']], columns = ['LSTAT','RM'])
Y = boston['MEDV']
We split the data into training and test sets
from sklearn.model_selection import train_test_split
# splits the training and test data set in 80% : 20%
# assign random_state to any value.This ensures consistency.
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state=5)
print(X_train.shape)
print(X_test.shape)
print(Y_train.shape)
print(Y_test.shape)
Then we use the linear regression functionality from Scikit-Learn
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
lin_model = LinearRegression()
lin_model.fit(X_train, Y_train)
# model evaluation for training set
y_train_predict = lin_model.predict(X_train)
rmse = (np.sqrt(mean_squared_error(Y_train, y_train_predict)))
r2 = r2_score(Y_train, y_train_predict)
print("The model performance for training set")
print("--------------------------------------")
print('RMSE is {}'.format(rmse))
print('R2 score is {}'.format(r2))
print("\n")
# model evaluation for testing set
y_test_predict = lin_model.predict(X_test)
# root mean square error of the model
rmse = (np.sqrt(mean_squared_error(Y_test, y_test_predict)))
# r-squared score of the model
r2 = r2_score(Y_test, y_test_predict)
print("The model performance for testing set")
print("--------------------------------------")
print('RMSE is {}'.format(rmse))
print('R2 score is {}'.format(r2))
# plotting the y_test vs y_pred
# ideally should have been a straight line
plt.scatter(Y_test, y_test_predict)
plt.show()
10.30. Reducing the number of degrees of freedom, overarching view¶
Many Machine Learning problems involve thousands or even millions of features for each training instance. Not only does this make training extremely slow, it can also make it much harder to find a good solution, as we will see. This problem is often referred to as the curse of dimensionality. Fortunately, in real-world problems, it is often possible to reduce the number of features considerably, turning an intractable problem into a tractable one.
Later we will discuss some of the most popular dimensionality reduction techniques: the principal component analysis (PCA), Kernel PCA, and Locally Linear Embedding (LLE).
Principal component analysis and its various variants deal with the problem of fitting a low-dimensional affine subspace to a set of of data points in a high-dimensional space. With its family of methods it is one of the most used tools in data modeling, compression and visualization.
10.31. Preprocessing our data¶
Before we proceed however, we will discuss how to preprocess our data. Till now and in connection with our previous examples we have not met so many cases where we are too sensitive to the scaling of our data. Normally the data may need a rescaling and/or may be sensitive to extreme values. Scaling the data renders our inputs much more suitable for the algorithms we want to employ.
Scikit-Learn has several functions which allow us to rescale the
data, normally resulting in much better results in terms of various
accuracy scores. The StandardScaler function in Scikit-Learn
ensures that for each feature/predictor we study the mean value is
zero and the variance is one (every column in the design/feature
matrix). This scaling has the drawback that it does not ensure that
we have a particular maximum or minimum in our data set. Another
function included in Scikit-Learn is the MinMaxScaler which
ensures that all features are exactly between
10.32. More preprocessing¶
The Normalizer scales each data point such that the feature vector has a euclidean length of one. In other words, it projects a data point on the circle (or sphere in the case of higher dimensions) with a radius of 1. This means every data point is scaled by a different number (by the inverse of it’s length). This normalization is often used when only the direction (or angle) of the data matters, not the length of the feature vector.
The RobustScaler works similarly to the StandardScaler in that it ensures statistical properties for each feature that guarantee that they are on the same scale. However, the RobustScaler uses the median and quartiles, instead of mean and variance. This makes the RobustScaler ignore data points that are very different from the rest (like measurement errors). These odd data points are also called outliers, and might often lead to trouble for other scaling techniques.
10.33. Simple preprocessing examples, Franke function and regression¶
# Common imports
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn.linear_model as skl
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, StandardScaler, Normalizer
# Where to save the figures and data files
PROJECT_ROOT_DIR = "Results"
FIGURE_ID = "Results/FigureFiles"
DATA_ID = "DataFiles/"
if not os.path.exists(PROJECT_ROOT_DIR):
os.mkdir(PROJECT_ROOT_DIR)
if not os.path.exists(FIGURE_ID):
os.makedirs(FIGURE_ID)
if not os.path.exists(DATA_ID):
os.makedirs(DATA_ID)
def image_path(fig_id):
return os.path.join(FIGURE_ID, fig_id)
def data_path(dat_id):
return os.path.join(DATA_ID, dat_id)
def save_fig(fig_id):
plt.savefig(image_path(fig_id) + ".png", format='png')
def FrankeFunction(x,y):
term1 = 0.75*np.exp(-(0.25*(9*x-2)**2) - 0.25*((9*y-2)**2))
term2 = 0.75*np.exp(-((9*x+1)**2)/49.0 - 0.1*(9*y+1))
term3 = 0.5*np.exp(-(9*x-7)**2/4.0 - 0.25*((9*y-3)**2))
term4 = -0.2*np.exp(-(9*x-4)**2 - (9*y-7)**2)
return term1 + term2 + term3 + term4
def create_X(x, y, n ):
if len(x.shape) > 1:
x = np.ravel(x)
y = np.ravel(y)
N = len(x)
l = int((n+1)*(n+2)/2) # Number of elements in beta
X = np.ones((N,l))
for i in range(1,n+1):
q = int((i)*(i+1)/2)
for k in range(i+1):
X[:,q+k] = (x**(i-k))*(y**k)
return X
# Making meshgrid of datapoints and compute Franke's function
n = 5
N = 1000
x = np.sort(np.random.uniform(0, 1, N))
y = np.sort(np.random.uniform(0, 1, N))
z = FrankeFunction(x, y)
X = create_X(x, y, n=n)
# split in training and test data
X_train, X_test, y_train, y_test = train_test_split(X,z,test_size=0.2)
clf = skl.LinearRegression().fit(X_train, y_train)
# The mean squared error and R2 score
print("MSE before scaling: {:.2f}".format(mean_squared_error(clf.predict(X_test), y_test)))
print("R2 score before scaling {:.2f}".format(clf.score(X_test,y_test)))
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
print("Feature min values before scaling:\n {}".format(X_train.min(axis=0)))
print("Feature max values before scaling:\n {}".format(X_train.max(axis=0)))
print("Feature min values after scaling:\n {}".format(X_train_scaled.min(axis=0)))
print("Feature max values after scaling:\n {}".format(X_train_scaled.max(axis=0)))
clf = skl.LinearRegression().fit(X_train_scaled, y_train)
print("MSE after scaling: {:.2f}".format(mean_squared_error(clf.predict(X_test_scaled), y_test)))
print("R2 score for scaled data: {:.2f}".format(clf.score(X_test_scaled,y_test)))
10.34. The singular value decomposition¶
The examples we have looked at so far are cases where we normally can
invert the matrix
This may however not the be case in general and a standard matrix inversion algorithm based on say LU, QR or Cholesky decomposition may lead to singularities. We will see examples of this below.
There is however a way to partially circumvent this problem and also gain some insights about the ordinary least squares approach, and later shrinkage methods like Ridge and Lasso regressions.
This is given by the Singular Value Decomposition algorithm, perhaps the most powerful linear algebra algorithm. Let us look at a different example where we may have problems with the standard matrix inversion algorithm. Thereafter we dive into the math of the SVD.
10.35. Linear Regression Problems¶
One of the typical problems we encounter with linear regression, in particular
when the matrix
This means that the matrix may be rank deficient and it is basically impossible to
to model the data using linear regression. As an example, consider the matrix
The columns of
Super-collinearity of an
We see easily that
10.36. Fixing the singularity¶
If our design matrix
has linearly dependent column vectors, we will not be able to compute the inverse
of
A cheap ad hoc approach is simply to add a small diagonal component to the matrix to invert, that is we change
where
10.37. Basic math of the SVD¶
From standard linear algebra we know that a square matrix
and the eigenvalues are given by the diagonal matrix
The matrix
with
Not all square matrices are diagonalizable. A matrix like the one discussed above
is not diagonalizable, it is a so-called defective matrix. It is easy to see that the condition
10.38. The SVD, a Fantastic Algorithm¶
However, and this is the strength of the SVD algorithm, any general
matrix
As an example, the above defective matrix can be decomposed as
with eigenvalues
The SVD
decomposition (singular values) gives eigenvalues
In the general case, where our design matrix
The columns of
10.39. Economy-size SVD¶
If we assume that
The economy-size decomposition removes extra rows or columns of zeros
from the diagonal matrix of singular values,
If
10.40. Codes for the SVD¶
import numpy as np
# SVD inversion
def SVDinv(A):
''' Takes as input a numpy matrix A and returns inv(A) based on singular value decomposition (SVD).
SVD is numerically more stable than the inversion algorithms provided by
numpy and scipy.linalg at the cost of being slower.
'''
U, s, VT = np.linalg.svd(A)
# print('test U')
# print( (np.transpose(U) @ U - U @np.transpose(U)))
# print('test VT')
# print( (np.transpose(VT) @ VT - VT @np.transpose(VT)))
print(U)
print(s)
print(VT)
D = np.zeros((len(U),len(VT)))
for i in range(0,len(VT)):
D[i,i]=s[i]
UT = np.transpose(U); V = np.transpose(VT); invD = np.linalg.inv(D)
return np.matmul(V,np.matmul(invD,UT))
X = np.array([ [1.0, -1.0, 2.0], [1.0, 0.0, 1.0], [1.0, 2.0, -1.0], [1.0, 1.0, 0.0] ])
print(X)
A = np.transpose(X) @ X
print(A)
# Brute force inversion of super-collinear matrix
#B = np.linalg.inv(A)
#print(B)
C = SVDinv(A)
print(C)
The matrix
10.41. Mathematical Properties¶
There are several interesting mathematical properties which will be relevant when we are going to discuss the differences between say ordinary least squares (OLS) and Ridge regression.
We have from OLS that the parameters of the linear approximation are given by
The matrix to invert can be rewritten in terms of our SVD decomposition as
Using the orthogonality properties of
with
This means that
that is the eigenvectors of
that is, the eigenvectors of
Going back to our OLS equation we have
We will come back to this expression when we discuss Ridge regression.
$
$
It is indeed the economy-sized SVD, note the summation runs up tp $
Here we have that $
10.42. Ridge and LASSO Regression¶
Let us remind ourselves about the expression for the standard Mean Squared Error (MSE) which we used to define our cost function and the equations for the ordinary least squares (OLS) method, that is our optimization problem is
or we can state it as
where we have used the definition of a norm-2 vector, that is
By minimizing the above equation with respect to the parameters
which leads to the Ridge regression minimization problem where we
require that
we have a new optimization equation
which leads to Lasso regression. Lasso stands for least absolute shrinkage and selection operator.
Here we have defined the norm-1 as
10.43. More on Ridge Regression¶
Using the matrix-vector expression for Ridge regression,
by taking the derivatives with respect to
with
with
We see that Ridge regression is nothing but the standard
OLS with a modified diagonal term added to
Furthermore, if we use the result above in terms of the SVD decomposition (our analysis was done for the OLS method), we had
We can analyse the OLS solutions in terms of the eigenvectors (the columns) of the right singular value matrix
For Ridge regression this becomes
with the vectors
10.44. Interpreting the Ridge results¶
Since
Ridge regression finds the coordinates of
For small eigenvalues
10.45. More interpretations¶
For the sake of simplicity, let us assume that the design matrix is orthonormal, that is
In this case the standard OLS results in
and
that is the Ridge estimator scales the OLS estimator by the inverse of a factor
We will come back to more interpreations after we have gone through some of the statistical analysis part.
For more discussions of Ridge and Lasso regression, Wessel van Wieringen’s article is highly recommended. Similarly, Mehta et al’s article is also recommended.
10.46. A better understanding of regularization¶
The parameter
Here we will first look at how to analyze the difference between the
standard OLS equations and the Ridge expressions in terms of a linear
algebra analysis using the SVD algorithm. Thereafter, we will link
(see the material on the bias-variance tradeoff below) these
observation to the statisical analysis of the results. In particular
we consider how the variance of the parameters
10.47. Decomposing the OLS and Ridge expressions¶
We have our design matrix
with
The matrices
10.48. Introducing the Covariance and Correlation functions¶
Before we discuss the link between for example Ridge regression and the singular value decomposition, we need to remind ourselves about the definition of the covariance and the correlation function. These are quantities
Suppose we have defined two vectors
where for example
With this definition and recalling that the variance is defined as
we can rewrite the covariance matrix as
The covariance takes values between zero and infinity and may thus lead to problems with loss of numerical precision for particularly large values. It is common to scale the covariance matrix by introducing instead the correlation matrix defined via the so-called correlation function
The correlation function is then given by values
In the above example this is the function we constructed using pandas.
10.49. Correlation Function and Design/Feature Matrix¶
In our derivation of the various regression algorithms like Ordinary Least Squares or Ridge regression
we defined the design/feature matrix
with
with a given vector
With these definitions, we can now rewrite our
and the correlation matrix
10.50. Covariance Matrix Examples¶
The Numpy function np.cov calculates the covariance elements using
the factor
which in turn is converted into into the
# Importing various packages
import numpy as np
n = 100
x = np.random.normal(size=n)
print(np.mean(x))
y = 4+3*x+np.random.normal(size=n)
print(np.mean(y))
W = np.vstack((x, y))
C = np.cov(W)
print(C)
10.51. Correlation Matrix¶
The previous example can be converted into the correlation matrix by
simply scaling the matrix elements with the variances. We should also
subtract the mean values for each column. This leads to the following
code which sets up the correlations matrix for the previous example in
a more brute force way. Here we scale the mean values for each column of the design matrix, calculate the relevant mean values and variances and then finally set up the
import numpy as np
n = 100
# define two vectors
x = np.random.random(size=n)
y = 4+3*x+np.random.normal(size=n)
#scaling the x and y vectors
x = x - np.mean(x)
y = y - np.mean(y)
variance_x = np.sum(x@x)/n
variance_y = np.sum(y@y)/n
print(variance_x)
print(variance_y)
cov_xy = np.sum(x@y)/n
cov_xx = np.sum(x@x)/n
cov_yy = np.sum(y@y)/n
C = np.zeros((2,2))
C[0,0]= cov_xx/variance_x
C[1,1]= cov_yy/variance_y
C[0,1]= cov_xy/np.sqrt(variance_y*variance_x)
C[1,0]= C[0,1]
print(C)
We see that the matrix elements along the diagonal are one as they should be and that the matrix is symmetric. Furthermore, diagonalizing this matrix we easily see that it is a positive definite matrix.
The above procedure with numpy can be made more compact if we use pandas.
10.52. Correlation Matrix with Pandas¶
We whow here how we can set up the correlation matrix using pandas, as done in this simple code
import numpy as np
import pandas as pd
n = 10
x = np.random.normal(size=n)
x = x - np.mean(x)
y = 4+3*x+np.random.normal(size=n)
y = y - np.mean(y)
X = (np.vstack((x, y))).T
print(X)
Xpd = pd.DataFrame(X)
print(Xpd)
correlation_matrix = Xpd.corr()
print(correlation_matrix)
We expand this model to the Franke function discussed above.
10.53. Correlation Matrix with Pandas and the Franke function¶
# Common imports
import numpy as np
import pandas as pd
def FrankeFunction(x,y):
term1 = 0.75*np.exp(-(0.25*(9*x-2)**2) - 0.25*((9*y-2)**2))
term2 = 0.75*np.exp(-((9*x+1)**2)/49.0 - 0.1*(9*y+1))
term3 = 0.5*np.exp(-(9*x-7)**2/4.0 - 0.25*((9*y-3)**2))
term4 = -0.2*np.exp(-(9*x-4)**2 - (9*y-7)**2)
return term1 + term2 + term3 + term4
def create_X(x, y, n ):
if len(x.shape) > 1:
x = np.ravel(x)
y = np.ravel(y)
N = len(x)
l = int((n+1)*(n+2)/2) # Number of elements in beta
X = np.ones((N,l))
for i in range(1,n+1):
q = int((i)*(i+1)/2)
for k in range(i+1):
X[:,q+k] = (x**(i-k))*(y**k)
return X
# Making meshgrid of datapoints and compute Franke's function
n = 4
N = 100
x = np.sort(np.random.uniform(0, 1, N))
y = np.sort(np.random.uniform(0, 1, N))
z = FrankeFunction(x, y)
X = create_X(x, y, n=n)
Xpd = pd.DataFrame(X)
# subtract the mean values and set up the covariance matrix
Xpd = Xpd - Xpd.mean()
covariance_matrix = Xpd.cov()
print(covariance_matrix)
We note here that the covariance is zero for the first rows and
columns since all matrix elements in the design matrix were set to one
(we are fitting the function in terms of a polynomial of degree
This means that the variance for these elements will be zero and will cause problems when we set up the correlation matrix. We can simply drop these elements and construct a correlation matrix without these elements.
10.54. Rewriting the Covariance and/or Correlation Matrix¶
We can rewrite the covariance matrix in a more compact form in terms of the design/feature matrix
To see this let us simply look at a design matrix
If we then compute the expectation value
which is just
where we wrote $
It is easy to generalize this to a matrix
10.55. Linking with SVD¶
See lecture september 11. More text to be added here soon.
10.56. Where are we going?¶
Before we proceed, we need to rethink what we have been doing. In our eager to fit the data, we have omitted several important elements in our regression analysis. In what follows we will
look at statistical properties, including a discussion of mean values, variance and the so-called bias-variance tradeoff
introduce resampling techniques like cross-validation, bootstrapping and jackknife and more
This will allow us to link the standard linear algebra methods we have discussed above to a statistical interpretation of the methods.
10.57. Resampling methods¶
Resampling methods are an indispensable tool in modern statistics. They involve repeatedly drawing samples from a training set and refitting a model of interest on each sample in order to obtain additional information about the fitted model. For example, in order to estimate the variability of a linear regression fit, we can repeatedly draw different samples from the training data, fit a linear regression to each new sample, and then examine the extent to which the resulting fits differ. Such an approach may allow us to obtain information that would not be available from fitting the model only once using the original training sample.
Two resampling methods are often used in Machine Learning analyses,
The bootstrap method
and Cross-Validation
In addition there are several other methods such as the Jackknife and the Blocking methods. We will discuss in particular cross-validation and the bootstrap method.
10.58. Resampling approaches can be computationally expensive¶
Resampling approaches can be computationally expensive, because they involve fitting the same statistical method multiple times using different subsets of the training data. However, due to recent advances in computing power, the computational requirements of resampling methods generally are not prohibitive. In this chapter, we discuss two of the most commonly used resampling methods, cross-validation and the bootstrap. Both methods are important tools in the practical application of many statistical learning procedures. For example, cross-validation can be used to estimate the test error associated with a given statistical learning method in order to evaluate its performance, or to select the appropriate level of flexibility. The process of evaluating a model’s performance is known as model assessment, whereas the process of selecting the proper level of flexibility for a model is known as model selection. The bootstrap is widely used.
10.59. Why resampling methods ?¶
Statistical analysis.
Our simulations can be treated as computer experiments. This is particularly the case for Monte Carlo methods
The results can be analysed with the same statistical tools as we would use analysing experimental data.
As in all experiments, we are looking for expectation values and an estimate of how accurate they are, i.e., possible sources for errors.
10.60. Statistical analysis¶
As in other experiments, many numerical experiments have two classes of errors:
Statistical errors
Systematical errors
Statistical errors can be estimated using standard tools from statistics
Systematical errors are method specific and must be treated differently from case to case.
10.61. Linking the regression analysis with a statistical interpretation¶
The
advantage of doing linear regression is that we actually end up with
analytical expressions for several statistical quantities.
Standard least squares and Ridge regression allow us to
derive quantities like the variance and other expectation values in a
rather straightforward way.
It is assumed that
The randomness of
Recall that
10.62. Assumptions made¶
The assumption we have made here can be summarized as (and this is going to be useful when we discuss the bias-variance trade off)
that there exists a function
We approximate this function with our model from the solution of the linear regression equations, that is our
function
10.63. Expectation value and variance¶
We can calculate the expectation value of
while its variance is
Hence,
10.64. Expectation value and variance for ¶
With the OLS expressions for the parameters
This means that the estimator of the regression parameters is unbiased.
We can also calculate the variance
The variance of
where we have used that
In a similar way, we can obtain analytical expressions for say the
expectation values of the parameters
It is rather straightforward to show that
We see clearly that
We can also compute the variance as
and it is easy to see that if the parameter
With this, we can compute the difference
The difference is non-negative definite since each component of the
matrix product is non-negative definite.
This means the variance we obtain with the standard OLS will always for
10.65. Resampling methods¶
With all these analytical equations for both the OLS and Ridge regression, we will now outline how to assess a given model. This will lead us to a discussion of the so-called bias-variance tradeoff (see below) and so-called resampling methods.
One of the quantities we have discussed as a way to measure errors is the mean-squared error (MSE), mainly used for fitting of continuous functions. Another choice is the absolute error.
In the discussions below we will focus on the MSE and in particular since we will split the data into test and training data, we discuss the
prediction error or simply the test error
, where we have a fixed training set and the test error is the MSE arising from the data reserved for testing. We discuss also thetraining error
, which is the average loss over the training data.
As our model becomes more and more complex, more of the training data tends to used. The training may thence adapt to more complicated structures in the data. This may lead to a decrease in the bias (see below for code example) and a slight increase of the variance for the test error. For a certain level of complexity the test error will reach minimum, before starting to increase again. The training error reaches a saturation.
10.66. Resampling methods: Jackknife and Bootstrap¶
Two famous resampling methods are the independent bootstrap and the jackknife.
The jackknife is a special case of the independent bootstrap. Still, the jackknife was made popular prior to the independent bootstrap. And as the popularity of the independent bootstrap soared, new variants, such as the dependent bootstrap.
The Jackknife and independent bootstrap work for
independent, identically distributed random variables.
If these conditions are not
satisfied, the methods will fail. Yet, it should be said that if the data are
independent, identically distributed, and we only want to estimate the
variance of
10.67. Resampling methods: Jackknife¶
The Jackknife works by making many replicas of the estimator
which equals the vector
10.68. Jackknife code example¶
from numpy import *
from numpy.random import randint, randn
from time import time
def jackknife(data, stat):
n = len(data);t = zeros(n); inds = arange(n); t0 = time()
## 'jackknifing' by leaving out an observation for each i
for i in range(n):
t[i] = stat(delete(data,i) )
# analysis
print("Runtime: %g sec" % (time()-t0)); print("Jackknife Statistics :")
print("original bias std. error")
print("%8g %14g %15g" % (stat(data),(n-1)*mean(t)/n, (n*var(t))**.5))
return t
# Returns mean of data samples
def stat(data):
return mean(data)
mu, sigma = 100, 15
datapoints = 10000
x = mu + sigma*random.randn(datapoints)
# jackknife returns the data sample
t = jackknife(x, stat)
10.69. Resampling methods: Bootstrap¶
Bootstrapping is a nonparametric approach to statistical inference that substitutes computation for more traditional distributional assumptions and asymptotic results. Bootstrapping offers a number of advantages:
The bootstrap is quite general, although there are some cases in which it fails.
Because it does not require distributional assumptions (such as normally distributed errors), the bootstrap can provide more accurate inferences when the data are not well behaved or when the sample size is small.
It is possible to apply the bootstrap to statistics with sampling distributions that are difficult to derive, even asymptotically.
It is relatively simple to apply the bootstrap to complex data-collection plans (such as stratified and clustered samples).
10.70. Resampling methods: Bootstrap background¶
Since
10.71. Resampling methods: More Bootstrap background¶
In the case that
Drawing lots of numbers from
, suppose we call one such set of numbers .Then using these numbers, we could compute a replica of
called .
By repeated use of (1) and (2), many
estimates of
10.72. Resampling methods: Bootstrap approach¶
But
unless there is enough information available about the process that
generated
Instead of generating the histogram for the relative
frequency of the observation
10.73. Resampling methods: Bootstrap steps¶
The independent bootstrap works like this:
Draw with replacement
numbers for the observed variables .Define a vector
containing the values which were drawn from .Using the vector
compute by evaluating under the observations .Repeat this process
times.
When you are done, you can draw a histogram of the relative frequency
of
10.74. Code example for the Bootstrap method¶
The following code starts with a Gaussian distribution with mean value
from numpy import *
from numpy.random import randint, randn
from time import time
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
# Returns mean of bootstrap samples
def stat(data):
return mean(data)
# Bootstrap algorithm
def bootstrap(data, statistic, R):
t = zeros(R); n = len(data); inds = arange(n); t0 = time()
# non-parametric bootstrap
for i in range(R):
t[i] = statistic(data[randint(0,n,n)])
# analysis
print("Runtime: %g sec" % (time()-t0)); print("Bootstrap Statistics :")
print("original bias std. error")
print("%8g %8g %14g %15g" % (statistic(data), std(data),mean(t),std(t)))
return t
mu, sigma = 100, 15
datapoints = 10000
x = mu + sigma*random.randn(datapoints)
# bootstrap returns the data sample
t = bootstrap(x, stat, datapoints)
# the histogram of the bootstrapped data
n, binsboot, patches = plt.hist(t, 50, normed=1, facecolor='red', alpha=0.75)
# add a 'best fit' line
y = mlab.normpdf( binsboot, mean(t), std(t))
lt = plt.plot(binsboot, y, 'r--', linewidth=1)
plt.xlabel('Smarts')
plt.ylabel('Probability')
plt.axis([99.5, 100.6, 0, 3.0])
plt.grid(True)
plt.show()
10.75. Various steps in cross-validation¶
When the repetitive splitting of the data set is done randomly,
samples may accidently end up in a fast majority of the splits in
either training or test set. Such samples may have an unbalanced
influence on either model building or prediction evaluation. To avoid
this
10.76. How to set up the cross-validation for Ridge and/or Lasso¶
Define a range of interest for the penalty parameter.
Divide the data set into training and test set comprising samples
and , respectively.Fit the linear regression model by means of ridge estimation for each
in the grid using the training set, and the corresponding estimate of the error variance , as
Evaluate the prediction performance of these models on the test set by
. Or, by the prediction error , the relative error, the error squared or the R2 score function.Repeat the first three steps such that each sample plays the role of the test set once.
Average the prediction performances of the test sets at each grid point of the penalty bias/parameter. It is an estimate of the prediction performance of the model corresponding to this value of the penalty parameter on novel data. It is defined as
10.77. Cross-validation in brief¶
For the various values of
shuffle the dataset randomly.
Split the dataset into
groups.For each unique group:
a. Decide which group to use as set for test data
b. Take the remaining groups as a training data set
c. Fit a model on the training set and evaluate it on the test set
d. Retain the evaluation score and discard the model
Summarize the model using the sample of model evaluation scores
10.78. Code Example for Cross-validation and -fold Cross-validation¶
The code here uses Ridge regression with cross-validation (CV) resampling and
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import KFold
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import PolynomialFeatures
# A seed just to ensure that the random numbers are the same for every run.
# Useful for eventual debugging.
np.random.seed(3155)
# Generate the data.
nsamples = 100
x = np.random.randn(nsamples)
y = 3*x**2 + np.random.randn(nsamples)
## Cross-validation on Ridge regression using KFold only
# Decide degree on polynomial to fit
poly = PolynomialFeatures(degree = 6)
# Decide which values of lambda to use
nlambdas = 500
lambdas = np.logspace(-3, 5, nlambdas)
# Initialize a KFold instance
k = 5
kfold = KFold(n_splits = k)
# Perform the cross-validation to estimate MSE
scores_KFold = np.zeros((nlambdas, k))
i = 0
for lmb in lambdas:
ridge = Ridge(alpha = lmb)
j = 0
for train_inds, test_inds in kfold.split(x):
xtrain = x[train_inds]
ytrain = y[train_inds]
xtest = x[test_inds]
ytest = y[test_inds]
Xtrain = poly.fit_transform(xtrain[:, np.newaxis])
ridge.fit(Xtrain, ytrain[:, np.newaxis])
Xtest = poly.fit_transform(xtest[:, np.newaxis])
ypred = ridge.predict(Xtest)
scores_KFold[i,j] = np.sum((ypred - ytest[:, np.newaxis])**2)/np.size(ypred)
j += 1
i += 1
estimated_mse_KFold = np.mean(scores_KFold, axis = 1)
## Cross-validation using cross_val_score from sklearn along with KFold
# kfold is an instance initialized above as:
# kfold = KFold(n_splits = k)
estimated_mse_sklearn = np.zeros(nlambdas)
i = 0
for lmb in lambdas:
ridge = Ridge(alpha = lmb)
X = poly.fit_transform(x[:, np.newaxis])
estimated_mse_folds = cross_val_score(ridge, X, y[:, np.newaxis], scoring='neg_mean_squared_error', cv=kfold)
# cross_val_score return an array containing the estimated negative mse for every fold.
# we have to the the mean of every array in order to get an estimate of the mse of the model
estimated_mse_sklearn[i] = np.mean(-estimated_mse_folds)
i += 1
## Plot and compare the slightly different ways to perform cross-validation
plt.figure()
plt.plot(np.log10(lambdas), estimated_mse_sklearn, label = 'cross_val_score')
plt.plot(np.log10(lambdas), estimated_mse_KFold, 'r--', label = 'KFold')
plt.xlabel('log10(lambda)')
plt.ylabel('mse')
plt.legend()
plt.show()
10.79. The bias-variance tradeoff¶
We will discuss the bias-variance tradeoff in the context of
continuous predictions such as regression. However, many of the
intuitions and ideas discussed here also carry over to classification
tasks. Consider a dataset
Let us assume that the true data is generated from a noisy model
where
In our derivation of the ordinary least squares method we defined then
an approximation to the function
Thereafter we found the parameters
We can rewrite this as
The three terms represent the square of the bias of the learning
method, which can be thought of as the error caused by the simplifying
assumptions built into the method. The second term represents the
variance of the chosen model and finally the last terms is variance of
the error
To derive this equation, we need to recall that the variance of
and adding and subtracting
which, using the abovementioned expectation values can be rewritten as
that is the rewriting in terms of the so-called bias, the variance of the model
10.80. Example code for Bias-Variance tradeoff¶
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.utils import resample
np.random.seed(2018)
n = 500
n_boostraps = 100
degree = 18 # A quite high value, just to show.
noise = 0.1
# Make data set.
x = np.linspace(-1, 3, n).reshape(-1, 1)
y = np.exp(-x**2) + 1.5 * np.exp(-(x-2)**2) + np.random.normal(0, 0.1, x.shape)
# Hold out some test data that is never used in training.
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
# Combine x transformation and model into one operation.
# Not neccesary, but convenient.
model = make_pipeline(PolynomialFeatures(degree=degree), LinearRegression(fit_intercept=False))
# The following (m x n_bootstraps) matrix holds the column vectors y_pred
# for each bootstrap iteration.
y_pred = np.empty((y_test.shape[0], n_boostraps))
for i in range(n_boostraps):
x_, y_ = resample(x_train, y_train)
# Evaluate the new model on the same test data each time.
y_pred[:, i] = model.fit(x_, y_).predict(x_test).ravel()
# Note: Expectations and variances taken w.r.t. different training
# data sets, hence the axis=1. Subsequent means are taken across the test data
# set in order to obtain a total value, but before this we have error/bias/variance
# calculated per data point in the test set.
# Note 2: The use of keepdims=True is important in the calculation of bias as this
# maintains the column vector form. Dropping this yields very unexpected results.
error = np.mean( np.mean((y_test - y_pred)**2, axis=1, keepdims=True) )
bias = np.mean( (y_test - np.mean(y_pred, axis=1, keepdims=True))**2 )
variance = np.mean( np.var(y_pred, axis=1, keepdims=True) )
print('Error:', error)
print('Bias^2:', bias)
print('Var:', variance)
print('{} >= {} + {} = {}'.format(error, bias, variance, bias+variance))
plt.plot(x[::5, :], y[::5, :], label='f(x)')
plt.scatter(x_test, y_test, label='Data points')
plt.scatter(x_test, np.mean(y_pred, axis=1), label='Pred')
plt.legend()
plt.show()
10.81. Understanding what happens¶
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.utils import resample
np.random.seed(2018)
n = 40
n_boostraps = 100
maxdegree = 14
# Make data set.
x = np.linspace(-3, 3, n).reshape(-1, 1)
y = np.exp(-x**2) + 1.5 * np.exp(-(x-2)**2)+ np.random.normal(0, 0.1, x.shape)
error = np.zeros(maxdegree)
bias = np.zeros(maxdegree)
variance = np.zeros(maxdegree)
polydegree = np.zeros(maxdegree)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
for degree in range(maxdegree):
model = make_pipeline(PolynomialFeatures(degree=degree), LinearRegression(fit_intercept=False))
y_pred = np.empty((y_test.shape[0], n_boostraps))
for i in range(n_boostraps):
x_, y_ = resample(x_train, y_train)
y_pred[:, i] = model.fit(x_, y_).predict(x_test).ravel()
polydegree[degree] = degree
error[degree] = np.mean( np.mean((y_test - y_pred)**2, axis=1, keepdims=True) )
bias[degree] = np.mean( (y_test - np.mean(y_pred, axis=1, keepdims=True))**2 )
variance[degree] = np.mean( np.var(y_pred, axis=1, keepdims=True) )
print('Polynomial degree:', degree)
print('Error:', error[degree])
print('Bias^2:', bias[degree])
print('Var:', variance[degree])
print('{} >= {} + {} = {}'.format(error[degree], bias[degree], variance[degree], bias[degree]+variance[degree]))
plt.plot(polydegree, error, label='Error')
plt.plot(polydegree, bias, label='bias')
plt.plot(polydegree, variance, label='Variance')
plt.legend()
plt.show()
10.82. Summing up¶
The bias-variance tradeoff summarizes the fundamental tension in machine learning, particularly supervised learning, between the complexity of a model and the amount of training data needed to train it. Since data is often limited, in practice it is often useful to use a less-complex model with higher bias, that is a model whose asymptotic performance is worse than another model because it is easier to train and less sensitive to sampling noise arising from having a finite-sized training dataset (smaller variance).
The above equations tell us that in
order to minimize the expected test error, we need to select a
statistical learning method that simultaneously achieves low variance
and low bias. Note that variance is inherently a nonnegative quantity,
and squared bias is also nonnegative. Hence, we see that the expected
test MSE can never lie below
What do we mean by the variance and bias of a statistical learning method? The variance refers to the amount by which our model would change if we estimated it using a different training data set. Since the training data are used to fit the statistical learning method, different training data sets will result in a different estimate. But ideally the estimate for our model should not vary too much between training sets. However, if a method has high variance then small changes in the training data can result in large changes in the model. In general, more flexible statistical methods have higher variance.
You may also find this recent article of interest.
10.83. Another Example from Scikit-Learn’s Repository¶
"""
============================
Underfitting vs. Overfitting
============================
This example demonstrates the problems of underfitting and overfitting and
how we can use linear regression with polynomial features to approximate
nonlinear functions. The plot shows the function that we want to approximate,
which is a part of the cosine function. In addition, the samples from the
real function and the approximations of different models are displayed. The
models have polynomial features of different degrees. We can see that a
linear function (polynomial with degree 1) is not sufficient to fit the
training samples. This is called **underfitting**. A polynomial of degree 4
approximates the true function almost perfectly. However, for higher degrees
the model will **overfit** the training data, i.e. it learns the noise of the
training data.
We evaluate quantitatively **overfitting** / **underfitting** by using
cross-validation. We calculate the mean squared error (MSE) on the validation
set, the higher, the less likely the model generalizes correctly from the
training data.
"""
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
def true_fun(X):
return np.cos(1.5 * np.pi * X)
np.random.seed(0)
n_samples = 30
degrees = [1, 4, 15]
X = np.sort(np.random.rand(n_samples))
y = true_fun(X) + np.random.randn(n_samples) * 0.1
plt.figure(figsize=(14, 5))
for i in range(len(degrees)):
ax = plt.subplot(1, len(degrees), i + 1)
plt.setp(ax, xticks=(), yticks=())
polynomial_features = PolynomialFeatures(degree=degrees[i],
include_bias=False)
linear_regression = LinearRegression()
pipeline = Pipeline([("polynomial_features", polynomial_features),
("linear_regression", linear_regression)])
pipeline.fit(X[:, np.newaxis], y)
# Evaluate the models using crossvalidation
scores = cross_val_score(pipeline, X[:, np.newaxis], y,
scoring="neg_mean_squared_error", cv=10)
X_test = np.linspace(0, 1, 100)
plt.plot(X_test, pipeline.predict(X_test[:, np.newaxis]), label="Model")
plt.plot(X_test, true_fun(X_test), label="True function")
plt.scatter(X, y, edgecolor='b', s=20, label="Samples")
plt.xlabel("x")
plt.ylabel("y")
plt.xlim((0, 1))
plt.ylim((-2, 2))
plt.legend(loc="best")
plt.title("Degree {}\nMSE = {:.2e}(+/- {:.2e})".format(
degrees[i], -scores.mean(), scores.std()))
plt.show()
10.84. More examples on bootstrap and cross-validation and errors¶
# Common imports
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.model_selection import train_test_split
from sklearn.utils import resample
from sklearn.metrics import mean_squared_error
# Where to save the figures and data files
PROJECT_ROOT_DIR = "Results"
FIGURE_ID = "Results/FigureFiles"
DATA_ID = "DataFiles/"
if not os.path.exists(PROJECT_ROOT_DIR):
os.mkdir(PROJECT_ROOT_DIR)
if not os.path.exists(FIGURE_ID):
os.makedirs(FIGURE_ID)
if not os.path.exists(DATA_ID):
os.makedirs(DATA_ID)
def image_path(fig_id):
return os.path.join(FIGURE_ID, fig_id)
def data_path(dat_id):
return os.path.join(DATA_ID, dat_id)
def save_fig(fig_id):
plt.savefig(image_path(fig_id) + ".png", format='png')
infile = open(data_path("EoS.csv"),'r')
# Read the EoS data as csv file and organize the data into two arrays with density and energies
EoS = pd.read_csv(infile, names=('Density', 'Energy'))
EoS['Energy'] = pd.to_numeric(EoS['Energy'], errors='coerce')
EoS = EoS.dropna()
Energies = EoS['Energy']
Density = EoS['Density']
# The design matrix now as function of various polytrops
Maxpolydegree = 30
X = np.zeros((len(Density),Maxpolydegree))
X[:,0] = 1.0
testerror = np.zeros(Maxpolydegree)
trainingerror = np.zeros(Maxpolydegree)
polynomial = np.zeros(Maxpolydegree)
trials = 100
for polydegree in range(1, Maxpolydegree):
polynomial[polydegree] = polydegree
for degree in range(polydegree):
X[:,degree] = Density**(degree/3.0)
# loop over trials in order to estimate the expectation value of the MSE
testerror[polydegree] = 0.0
trainingerror[polydegree] = 0.0
for samples in range(trials):
x_train, x_test, y_train, y_test = train_test_split(X, Energies, test_size=0.2)
model = LinearRegression(fit_intercept=True).fit(x_train, y_train)
ypred = model.predict(x_train)
ytilde = model.predict(x_test)
testerror[polydegree] += mean_squared_error(y_test, ytilde)
trainingerror[polydegree] += mean_squared_error(y_train, ypred)
testerror[polydegree] /= trials
trainingerror[polydegree] /= trials
print("Degree of polynomial: %3d"% polynomial[polydegree])
print("Mean squared error on training data: %.8f" % trainingerror[polydegree])
print("Mean squared error on test data: %.8f" % testerror[polydegree])
plt.plot(polynomial, np.log10(trainingerror), label='Training Error')
plt.plot(polynomial, np.log10(testerror), label='Test Error')
plt.xlabel('Polynomial degree')
plt.ylabel('log10[MSE]')
plt.legend()
plt.show()
10.85. The same example but now with cross-validation¶
# Common imports
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
# Where to save the figures and data files
PROJECT_ROOT_DIR = "Results"
FIGURE_ID = "Results/FigureFiles"
DATA_ID = "DataFiles/"
if not os.path.exists(PROJECT_ROOT_DIR):
os.mkdir(PROJECT_ROOT_DIR)
if not os.path.exists(FIGURE_ID):
os.makedirs(FIGURE_ID)
if not os.path.exists(DATA_ID):
os.makedirs(DATA_ID)
def image_path(fig_id):
return os.path.join(FIGURE_ID, fig_id)
def data_path(dat_id):
return os.path.join(DATA_ID, dat_id)
def save_fig(fig_id):
plt.savefig(image_path(fig_id) + ".png", format='png')
infile = open(data_path("EoS.csv"),'r')
# Read the EoS data as csv file and organize the data into two arrays with density and energies
EoS = pd.read_csv(infile, names=('Density', 'Energy'))
EoS['Energy'] = pd.to_numeric(EoS['Energy'], errors='coerce')
EoS = EoS.dropna()
Energies = EoS['Energy']
Density = EoS['Density']
# The design matrix now as function of various polytrops
Maxpolydegree = 30
X = np.zeros((len(Density),Maxpolydegree))
X[:,0] = 1.0
estimated_mse_sklearn = np.zeros(Maxpolydegree)
polynomial = np.zeros(Maxpolydegree)
k =5
kfold = KFold(n_splits = k)
for polydegree in range(1, Maxpolydegree):
polynomial[polydegree] = polydegree
for degree in range(polydegree):
X[:,degree] = Density**(degree/3.0)
OLS = LinearRegression()
# loop over trials in order to estimate the expectation value of the MSE
estimated_mse_folds = cross_val_score(OLS, X, Energies, scoring='neg_mean_squared_error', cv=kfold)
#[:, np.newaxis]
estimated_mse_sklearn[polydegree] = np.mean(-estimated_mse_folds)
plt.plot(polynomial, np.log10(estimated_mse_sklearn), label='Test Error')
plt.xlabel('Polynomial degree')
plt.ylabel('log10[MSE]')
plt.legend()
plt.show()
10.86. Cross-validation with Ridge¶
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import KFold
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import PolynomialFeatures
# A seed just to ensure that the random numbers are the same for every run.
np.random.seed(3155)
# Generate the data.
n = 100
x = np.linspace(-3, 3, n).reshape(-1, 1)
y = np.exp(-x**2) + 1.5 * np.exp(-(x-2)**2)+ np.random.normal(0, 0.1, x.shape)
# Decide degree on polynomial to fit
poly = PolynomialFeatures(degree = 10)
# Decide which values of lambda to use
nlambdas = 500
lambdas = np.logspace(-3, 5, nlambdas)
# Initialize a KFold instance
k = 5
kfold = KFold(n_splits = k)
estimated_mse_sklearn = np.zeros(nlambdas)
i = 0
for lmb in lambdas:
ridge = Ridge(alpha = lmb)
estimated_mse_folds = cross_val_score(ridge, x, y, scoring='neg_mean_squared_error', cv=kfold)
estimated_mse_sklearn[i] = np.mean(-estimated_mse_folds)
i += 1
plt.figure()
plt.plot(np.log10(lambdas), estimated_mse_sklearn, label = 'cross_val_score')
plt.xlabel('log10(lambda)')
plt.ylabel('MSE')
plt.legend()
plt.show()
10.87. The Ising model¶
The one-dimensional Ising model with nearest neighbor interaction, no
external field and a constant coupling constant
where
We will look at a system of
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.axes_grid1 import make_axes_locatable
import seaborn as sns
import scipy.linalg as scl
from sklearn.model_selection import train_test_split
import tqdm
sns.set(color_codes=True)
cmap_args=dict(vmin=-1., vmax=1., cmap='seismic')
L = 40
n = int(1e4)
spins = np.random.choice([-1, 1], size=(n, L))
J = 1.0
energies = np.zeros(n)
for i in range(n):
energies[i] = - J * np.dot(spins[i], np.roll(spins[i], 1))
Here we use ordinary least squares regression to predict the energy for the nearest neighbor one-dimensional Ising model on a ring, i.e., the endpoints wrap around. We will use linear regression to fit a value for the coupling constant to achieve this.
10.88. Reformulating the problem to suit regression¶
A more general form for the one-dimensional Ising model is
Here we allow for interactions beyond the nearest neighbors and a state dependent coupling constant. This latter expression can be formulated as a matrix-product
where
We split the data in training and test data as discussed in the previous example
X = np.zeros((n, L ** 2))
for i in range(n):
X[i] = np.outer(spins[i], spins[i]).ravel()
y = energies
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
10.89. Linear regression¶
In the ordinary least squares method we choose the cost function
We then find the extremal point of
which immediately imposes some requirements on
X_train_own = np.concatenate(
(np.ones(len(X_train))[:, np.newaxis], X_train),
axis=1
)
X_test_own = np.concatenate(
(np.ones(len(X_test))[:, np.newaxis], X_test),
axis=1
)
def ols_inv(x: np.ndarray, y: np.ndarray) -> np.ndarray:
return scl.inv(x.T @ x) @ (x.T @ y)
beta = ols_inv(X_train_own, y_train)
10.90. Singular Value decomposition¶
Doing the inversion directly turns out to be a bad idea since the matrix
where the pseudoinverse of
Using singular value decomposition we can decompose the matrix
Note that solving this equation by actually doing the pseudoinverse
(which is what we will do) is not a good idea as this operation scales
as
def ols_svd(x: np.ndarray, y: np.ndarray) -> np.ndarray:
u, s, v = scl.svd(x)
return v.T @ scl.pinv(scl.diagsvd(s, u.shape[0], v.shape[0])) @ u.T @ y
beta = ols_svd(X_train_own,y_train)
When extracting the
J = beta[1:].reshape(L, L)
A way of looking at the coefficients in
fig = plt.figure(figsize=(20, 14))
im = plt.imshow(J, **cmap_args)
plt.title("OLS", fontsize=18)
plt.xticks(fontsize=18)
plt.yticks(fontsize=18)
cb = fig.colorbar(im)
cb.ax.set_yticklabels(cb.ax.get_yticklabels(), fontsize=18)
plt.show()
It is interesting to note that OLS
considers both
In this case our matrix inversion was actually possible. The obvious question now is what is the mathematics behind the SVD?
10.91. The one-dimensional Ising model¶
Let us bring back the Ising model again, but now with an additional
focus on Ridge and Lasso regression as well. We repeat some of the
basic parts of the Ising model and the setup of the training and test
data. The one-dimensional Ising model with nearest neighbor
interaction, no external field and a constant coupling constant
where
We will look at a system of
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.axes_grid1 import make_axes_locatable
import seaborn as sns
import scipy.linalg as scl
from sklearn.model_selection import train_test_split
import sklearn.linear_model as skl
import tqdm
sns.set(color_codes=True)
cmap_args=dict(vmin=-1., vmax=1., cmap='seismic')
L = 40
n = int(1e4)
spins = np.random.choice([-1, 1], size=(n, L))
J = 1.0
energies = np.zeros(n)
for i in range(n):
energies[i] = - J * np.dot(spins[i], np.roll(spins[i], 1))
A more general form for the one-dimensional Ising model is
Here we allow for interactions beyond the nearest neighbors and a more adaptive coupling matrix. This latter expression can be formulated as a matrix-product on the form
where
We organize the data as we did above
X = np.zeros((n, L ** 2))
for i in range(n):
X[i] = np.outer(spins[i], spins[i]).ravel()
y = energies
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.96)
X_train_own = np.concatenate(
(np.ones(len(X_train))[:, np.newaxis], X_train),
axis=1
)
X_test_own = np.concatenate(
(np.ones(len(X_test))[:, np.newaxis], X_test),
axis=1
)
We will do all fitting with Scikit-Learn,
clf = skl.LinearRegression().fit(X_train, y_train)
When extracting the
J_sk = clf.coef_.reshape(L, L)
And then we plot the results
fig = plt.figure(figsize=(20, 14))
im = plt.imshow(J_sk, **cmap_args)
plt.title("LinearRegression from Scikit-learn", fontsize=18)
plt.xticks(fontsize=18)
plt.yticks(fontsize=18)
cb = fig.colorbar(im)
cb.ax.set_yticklabels(cb.ax.get_yticklabels(), fontsize=18)
plt.show()
The results perfectly with our previous discussion where we used our own code.
10.92. Ridge regression¶
Having explored the ordinary least squares we move on to ridge
regression. In ridge regression we include a regularizer. This
involves a new cost function which leads to a new estimate for the
weights
1 3 6
< < < ! ! M A T H _ B L O C K
_lambda = 0.1
clf_ridge = skl.Ridge(alpha=_lambda).fit(X_train, y_train)
J_ridge_sk = clf_ridge.coef_.reshape(L, L)
fig = plt.figure(figsize=(20, 14))
im = plt.imshow(J_ridge_sk, **cmap_args)
plt.title("Ridge from Scikit-learn", fontsize=18)
plt.xticks(fontsize=18)
plt.yticks(fontsize=18)
cb = fig.colorbar(im)
cb.ax.set_yticklabels(cb.ax.get_yticklabels(), fontsize=18)
plt.show()
10.93. LASSO regression¶
In the Least Absolute Shrinkage and Selection Operator (LASSO)-method we get a third cost function.
Finding the extremal point of this cost function is not so straight-forward as in least squares and ridge. We will therefore rely solely on the function Lasso
from Scikit-Learn.
clf_lasso = skl.Lasso(alpha=_lambda).fit(X_train, y_train)
J_lasso_sk = clf_lasso.coef_.reshape(L, L)
fig = plt.figure(figsize=(20, 14))
im = plt.imshow(J_lasso_sk, **cmap_args)
plt.title("Lasso from Scikit-learn", fontsize=18)
plt.xticks(fontsize=18)
plt.yticks(fontsize=18)
cb = fig.colorbar(im)
cb.ax.set_yticklabels(cb.ax.get_yticklabels(), fontsize=18)
plt.show()
It is quite striking how LASSO breaks the symmetry of the coupling
constant as opposed to ridge and OLS. We get a sparse solution with
10.94. Performance as function of the regularization parameter¶
We see how the different models perform for a different set of values for
lambdas = np.logspace(-4, 5, 10)
train_errors = {
"ols_sk": np.zeros(lambdas.size),
"ridge_sk": np.zeros(lambdas.size),
"lasso_sk": np.zeros(lambdas.size)
}
test_errors = {
"ols_sk": np.zeros(lambdas.size),
"ridge_sk": np.zeros(lambdas.size),
"lasso_sk": np.zeros(lambdas.size)
}
plot_counter = 1
fig = plt.figure(figsize=(32, 54))
for i, _lambda in enumerate(tqdm.tqdm(lambdas)):
for key, method in zip(
["ols_sk", "ridge_sk", "lasso_sk"],
[skl.LinearRegression(), skl.Ridge(alpha=_lambda), skl.Lasso(alpha=_lambda)]
):
method = method.fit(X_train, y_train)
train_errors[key][i] = method.score(X_train, y_train)
test_errors[key][i] = method.score(X_test, y_test)
omega = method.coef_.reshape(L, L)
plt.subplot(10, 5, plot_counter)
plt.imshow(omega, **cmap_args)
plt.title(r"%s, $\lambda = %.4f$" % (key, _lambda))
plot_counter += 1
plt.show()
We see that LASSO reaches a good solution for low
values of
10.95. Finding the optimal value of ¶
To determine which value of
fig = plt.figure(figsize=(20, 14))
colors = {
"ols_sk": "r",
"ridge_sk": "y",
"lasso_sk": "c"
}
for key in train_errors:
plt.semilogx(
lambdas,
train_errors[key],
colors[key],
label="Train {0}".format(key),
linewidth=4.0
)
for key in test_errors:
plt.semilogx(
lambdas,
test_errors[key],
colors[key] + "--",
label="Test {0}".format(key),
linewidth=4.0
)
plt.legend(loc="best", fontsize=18)
plt.xlabel(r"$\lambda$", fontsize=18)
plt.ylabel(r"$R^2$", fontsize=18)
plt.tick_params(labelsize=18)
plt.show()
From the above figure we can see that LASSO with