Week 37: Statistical interpretations and Resampling Methods#
Morten Hjorth-Jensen, Department of Physics, University of Oslo, Norway
Date: September 9, 2024
Plans for week 37, lecture Monday#
Material for the lecture on Monday September 9.
Statistical interpretation of Ridge and Lasso regression, see also slides from last week
Resampling techniques, Bootstrap and cross validation and bias-variance tradeoff (this may partly be discussed during the exercise sessions as well.
Readings and Videos:
Raschka et al, pages 175-192
Hastie et al Chapter 7, here we recommend 7.1-7.5 and 7.10 (cross-validation) and 7.11 (bootstrap). See https://link.springer.com/book/10.1007/978-0-387-84858-7.
Plans for week 37, lab sessions#
Material for the lab sessions on Tuesday and Wednesday.
Calculations of expectation values
Discussion of resampling techniques
Exercise set for week 37
Work on project 1
For more discussions of Ridge regression and calculation of averages, Wessel van Wieringen’s article is highly recommended.
Material for lecture Monday September 9#
Deriving OLS from a probability distribution#
Our basic assumption when we derived the OLS equations was to assume that our output is determined by a given continuous function \(f(\boldsymbol{x})\) and a random noise \(\boldsymbol{\epsilon}\) given by the normal distribution with zero mean value and an undetermined variance \(\sigma^2\).
We found above that the outputs \(\boldsymbol{y}\) have a mean value given by \(\boldsymbol{X}\hat{\boldsymbol{\beta}}\) and variance \(\sigma^2\). Since the entries to the design matrix are not stochastic variables, we can assume that the probability distribution of our targets is also a normal distribution but now with mean value \(\boldsymbol{X}\hat{\boldsymbol{\beta}}\). This means that a single output \(y_i\) is given by the Gaussian distribution
Independent and Identically Distrubuted (iid)#
We assume now that the various \(y_i\) values are stochastically distributed according to the above Gaussian distribution. We define this distribution as
which reads as finding the likelihood of an event \(y_i\) with the input variables \(\boldsymbol{X}\) given the parameters (to be determined) \(\boldsymbol{\beta}\).
Since these events are assumed to be independent and identicall distributed we can build the probability distribution function (PDF) for all possible event \(\boldsymbol{y}\) as the product of the single events, that is we have
We will write this in a more compact form reserving \(\boldsymbol{D}\) for the domain of events, including the ouputs (targets) and the inputs. That is in case we have a simple one-dimensional input and output case
In the more general case the various inputs should be replaced by the possible features represented by the input data set \(\boldsymbol{X}\). We can now rewrite the above probability as
It is a conditional probability (see below) and reads as the likelihood of a domain of events \(\boldsymbol{D}\) given a set of parameters \(\boldsymbol{\beta}\).
Maximum Likelihood Estimation (MLE)#
In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is the most probable.
We will assume here that our events are given by the above Gaussian distribution and we will determine the optimal parameters \(\beta\) by maximizing the above PDF. However, computing the derivatives of a product function is cumbersome and can easily lead to overflow and/or underflowproblems, with potentials for loss of numerical precision.
In practice, it is more convenient to maximize the logarithm of the PDF because it is a monotonically increasing function of the argument. Alternatively, and this will be our option, we will minimize the negative of the logarithm since this is a monotonically decreasing function.
Note also that maximization/minimization of the logarithm of the PDF is equivalent to the maximization/minimization of the function itself.
A new Cost Function#
We could now define a new cost function to minimize, namely the negative logarithm of the above PDF
which becomes
Taking the derivative of the new cost function with respect to the parameters \(\beta\) we recognize our familiar OLS equation, namely
which leads to the well-known OLS equation for the optimal paramters \(\beta\)
Before we make a similar analysis for Ridge and Lasso regression, we need a short reminder on statistics.
More basic Statistics and Bayes’ theorem#
A central theorem in statistics is Bayes’ theorem. This theorem plays a similar role as the good old Pythagoras’ theorem in geometry. Bayes’ theorem is extremely simple to derive. But to do so we need some basic axioms from statistics.
Assume we have two domains of events \(X=[x_0,x_1,\dots,x_{n-1}]\) and \(Y=[y_0,y_1,\dots,y_{n-1}]\).
We define also the likelihood for \(X\) and \(Y\) as \(p(X)\) and \(p(Y)\) respectively. The likelihood of a specific event \(x_i\) (or \(y_i\)) is then written as \(p(X=x_i)\) or just \(p(x_i)=p_i\).
Union of events is given by.
The product rule (aka joint probability) is given by.
where we read \(p(X\vert Y)\) as the likelihood of obtaining \(X\) given \(Y\).
If we have independent events then \(p(X,Y)=p(X)p(Y)\).
Marginal Probability#
The marginal probability is defined in terms of only one of the set of variables \(X,Y\). For a discrete probability we have
Conditional Probability#
The conditional probability, if \(p(Y) > 0\), is
Bayes’ Theorem#
If we combine the conditional probability with the marginal probability and the standard product rule, we have
which we can rewrite as
which is Bayes’ theorem. It allows us to evaluate the uncertainty in in \(X\) after we have observed \(Y\). We can easily interchange \(X\) with \(Y\).
Interpretations of Bayes’ Theorem#
The quantity \(p(Y\vert X)\) on the right-hand side of the theorem is evaluated for the observed data \(Y\) and can be viewed as a function of the parameter space represented by \(X\). This function is not necesseraly normalized and is normally called the likelihood function.
The function \(p(X)\) on the right hand side is called the prior while the function on the left hand side is the called the posterior probability. The denominator on the right hand side serves as a normalization factor for the posterior distribution.
Let us try to illustrate Bayes’ theorem through an example.
Example of Usage of Bayes’ theorem#
Let us suppose that you are undergoing a series of mammography scans in order to rule out possible breast cancer cases. We define the sensitivity for a positive event by the variable \(X\). It takes binary values with \(X=1\) representing a positive event and \(X=0\) being a negative event. We reserve \(Y\) as a classification parameter for either a negative or a positive breast cancer confirmation. (Short note on wordings: positive here means having breast cancer, although none of us would consider this being a positive thing).
We let \(Y=1\) represent the the case of having breast cancer and \(Y=0\) as not.
Let us assume that if you have breast cancer, the test will be positive with a probability of \(0.8\), that is we have
This obviously sounds scary since many would conclude that if the test is positive, there is a likelihood of \(80\%\) for having cancer. It is however not correct, as the following Bayesian analysis shows.
Doing it correctly#
If we look at various national surveys on breast cancer, the general likelihood of developing breast cancer is a very small number. Let us assume that the prior probability in the population as a whole is
We need also to account for the fact that the test may produce a false positive result (false alarm). Let us here assume that we have
Using Bayes’ theorem we can then find the posterior probability that the person has breast cancer in case of a positive test, that is we can compute
That is, in case of a positive test, there is only a \(3\%\) chance of having breast cancer!
Bayes’ Theorem and Ridge and Lasso Regression#
Using Bayes’ theorem we can gain a better intuition about Ridge and Lasso regression.
For ordinary least squares we postulated that the maximum likelihood for the doamin of events \(\boldsymbol{D}\) (one-dimensional case)
is given by
In Bayes’ theorem this function plays the role of the so-called likelihood. We could now ask the question what is the posterior probability of a parameter set \(\boldsymbol{\beta}\) given a domain of events \(\boldsymbol{D}\)? That is, how can we define the posterior probability
Bayes’ theorem comes to our rescue here since (omitting the normalization constant)
We have a model for \(p(\boldsymbol{D}\vert\boldsymbol{\beta})\) but need one for the prior \(p(\boldsymbol{\beta})\)!
Ridge and Bayes#
With the posterior probability defined by a likelihood which we have already modeled and an unknown prior, we are now ready to make additional models for the prior.
We can, based on our discussions of the variance of \(\boldsymbol{\beta}\) and the mean value, assume that the prior for the values \(\boldsymbol{\beta}\) is given by a Gaussian with mean value zero and variance \(\tau^2\), that is
Our posterior probability becomes then (omitting the normalization factor which is just a constant)
We can now optimize this quantity with respect to \(\boldsymbol{\beta}\). As we did for OLS, this is most conveniently done by taking the negative logarithm of the posterior probability. Doing so and leaving out the constants terms that do not depend on \(\beta\), we have
and replacing \(1/2\tau^2\) with \(\lambda\) we have
which is our Ridge cost function! Nice, isn’t it?
Lasso and Bayes#
To derive the Lasso cost function, we simply replace the Gaussian prior with an exponential distribution (Laplace in this case) with zero mean value, that is
Our posterior probability becomes then (omitting the normalization factor which is just a constant)
Taking the negative logarithm of the posterior probability and leaving out the constants terms that do not depend on \(\beta\), we have
and replacing \(1/\tau\) with \(\lambda\) we have
which is our Lasso cost function!
Why resampling methods#
Before we proceed, we need to rethink what we have been doing. In our eager to fit the data, we have omitted several important elements in our regression analysis. In what follows we will
look at statistical properties, including a discussion of mean values, variance and the so-called bias-variance tradeoff
introduce resampling techniques like cross-validation, bootstrapping and jackknife and more
and discuss how to select a given model (one of the difficult parts in machine learning).
Resampling methods#
Resampling methods are an indispensable tool in modern statistics. They involve repeatedly drawing samples from a training set and refitting a model of interest on each sample in order to obtain additional information about the fitted model. For example, in order to estimate the variability of a linear regression fit, we can repeatedly draw different samples from the training data, fit a linear regression to each new sample, and then examine the extent to which the resulting fits differ. Such an approach may allow us to obtain information that would not be available from fitting the model only once using the original training sample.
Two resampling methods are often used in Machine Learning analyses,
The bootstrap method
and Cross-Validation
In addition there are several other methods such as the Jackknife and the Blocking methods. We will discuss in particular cross-validation and the bootstrap method.
Resampling approaches can be computationally expensive#
Resampling approaches can be computationally expensive, because they involve fitting the same statistical method multiple times using different subsets of the training data. However, due to recent advances in computing power, the computational requirements of resampling methods generally are not prohibitive. In this chapter, we discuss two of the most commonly used resampling methods, cross-validation and the bootstrap. Both methods are important tools in the practical application of many statistical learning procedures. For example, cross-validation can be used to estimate the test error associated with a given statistical learning method in order to evaluate its performance, or to select the appropriate level of flexibility. The process of evaluating a model’s performance is known as model assessment, whereas the process of selecting the proper level of flexibility for a model is known as model selection. The bootstrap is widely used.
Why resampling methods ?#
Statistical analysis.
Our simulations can be treated as computer experiments. This is particularly the case for Monte Carlo methods which are widely used in statistical analyses.
The results can be analysed with the same statistical tools as we would use when analysing experimental data.
As in all experiments, we are looking for expectation values and an estimate of how accurate they are, i.e., possible sources for errors.
Statistical analysis#
As in other experiments, many numerical experiments have two classes of errors:
Statistical errors
Systematical errors
Statistical errors can be estimated using standard tools from statistics
Systematical errors are method specific and must be treated differently from case to case.
Resampling methods#
With all these analytical equations for both the OLS and Ridge regression, we will now outline how to assess a given model. This will lead to a discussion of the so-called bias-variance tradeoff (see below) and so-called resampling methods.
One of the quantities we have discussed as a way to measure errors is the mean-squared error (MSE), mainly used for fitting of continuous functions. Another choice is the absolute error.
In the discussions below we will focus on the MSE and in particular since we will split the data into test and training data, we discuss the
prediction error or simply the test error \(\mathrm{Err_{Test}}\), where we have a fixed training set and the test error is the MSE arising from the data reserved for testing. We discuss also the
training error \(\mathrm{Err_{Train}}\), which is the average loss over the training data.
As our model becomes more and more complex, more of the training data tends to used. The training may thence adapt to more complicated structures in the data. This may lead to a decrease in the bias (see below for code example) and a slight increase of the variance for the test error. For a certain level of complexity the test error will reach minimum, before starting to increase again. The training error reaches a saturation.
Resampling methods: Bootstrap#
Bootstrapping is a non-parametric approach to statistical inference that substitutes computation for more traditional distributional assumptions and asymptotic results. Bootstrapping offers a number of advantages:
The bootstrap is quite general, although there are some cases in which it fails.
Because it does not require distributional assumptions (such as normally distributed errors), the bootstrap can provide more accurate inferences when the data are not well behaved or when the sample size is small.
It is possible to apply the bootstrap to statistics with sampling distributions that are difficult to derive, even asymptotically.
It is relatively simple to apply the bootstrap to complex data-collection plans (such as stratified and clustered samples).
The textbook by Davison on the Bootstrap Methods and their Applications provides many more insights and proofs. In this course we will take a more practical approach and use the results and theorems provided in the literature. For those interested in reading more about the bootstrap methods, we recommend the above text and the one by Efron and Tibshirani.
Before we proceed however, we need to remind ourselves about a central theorem in statistics, namely the so-called central limit theorem.
The Central Limit Theorem#
Suppose we have a PDF \(p(x)\) from which we generate a series \(N\) of averages \(\mathbb{E}[x_i]\). Each mean value \(\mathbb{E}[x_i]\) is viewed as the average of a specific measurement, e.g., throwing dice 100 times and then taking the average value, or producing a certain amount of random numbers. For notational ease, we set \(\mathbb{E}[x_i]=x_i\) in the discussion which follows. We do the same for \(\mathbb{E}[z]=z\).
If we compute the mean \(z\) of \(m\) such mean values \(x_i\)
the question we pose is which is the PDF of the new variable \(z\).
Finding the Limit#
The probability of obtaining an average value \(z\) is the product of the probabilities of obtaining arbitrary individual mean values \(x_i\), but with the constraint that the average is \(z\). We can express this through the following expression
where the \(\delta\)-function enbodies the constraint that the mean is \(z\). All measurements that lead to each individual \(x_i\) are expected to be independent, which in turn means that we can express \(\tilde{p}\) as the product of individual \(p(x_i)\). The independence assumption is important in the derivation of the central limit theorem.
Rewriting the \(\delta\)-function#
If we use the integral expression for the \(\delta\)-function
and inserting \(e^{i\mu q-i\mu q}\) where \(\mu\) is the mean value we arrive at
with the integral over \(x\) resulting in
Identifying Terms#
The second term on the rhs disappears since this is just the mean and employing the definition of \(\sigma^2\) we have
resulting in
and in the limit \(m\rightarrow \infty\) we obtain
which is the normal distribution with variance \(\sigma^2_m=\sigma^2/m\), where \(\sigma\) is the variance of the PDF \(p(x)\) and \(\mu\) is also the mean of the PDF \(p(x)\).
Wrapping it up#
Thus, the central limit theorem states that the PDF \(\tilde{p}(z)\) of the average of \(m\) random values corresponding to a PDF \(p(x)\) is a normal distribution whose mean is the mean value of the PDF \(p(x)\) and whose variance is the variance of the PDF \(p(x)\) divided by \(m\), the number of values used to compute \(z\).
The central limit theorem leads to the well-known expression for the standard deviation, given by
The latter is true only if the average value is known exactly. This is obtained in the limit \(m\rightarrow \infty\) only. Because the mean and the variance are measured quantities we obtain the familiar expression in statistics (the so-called Bessel correction)
In many cases however the above estimate for the standard deviation, in particular if correlations are strong, may be too simplistic. Keep in mind that we have assumed that the variables \(x\) are independent and identically distributed. This is obviously not always the case. For example, the random numbers (or better pseudorandom numbers) we generate in various calculations do always exhibit some correlations.
The theorem is satisfied by a large class of PDFs. Note however that for a finite \(m\), it is not always possible to find a closed form /analytic expression for \(\tilde{p}(x)\).
Confidence Intervals#
Confidence intervals are used in statistics and represent a type of estimate computed from the observed data. This gives a range of values for an unknown parameter such as the parameters \(\boldsymbol{\beta}\) from linear regression.
With the OLS expressions for the parameters \(\boldsymbol{\beta}\) we found \(\mathbb{E}(\boldsymbol{\beta}) = \boldsymbol{\beta}\), which means that the estimator of the regression parameters is unbiased.
In the exercises this week we show that the variance of the estimate of the \(j\)-th regression coefficient is \(\boldsymbol{\sigma}^2 (\boldsymbol{\beta}_j ) = \boldsymbol{\sigma}^2 [(\mathbf{X}^{T} \mathbf{X})^{-1}]_{jj} \).
This quantity can be used to construct a confidence interval for the estimates.
Standard Approach based on the Normal Distribution#
We will assume that the parameters \(\beta\) follow a normal distribution. We can then define the confidence interval. Here we will be using as shorthands \(\mu_{\beta}\) for the above mean value and \(\sigma_{\beta}\) for the standard deviation. We have then a confidence interval
where \(z\) defines the level of certainty (or confidence). For a normal distribution typical parameters are \(z=2.576\) which corresponds to a confidence of \(99\%\) while \(z=1.96\) corresponds to a confidence of \(95\%\). A confidence level of \(95\%\) is commonly used and it is normally referred to as a two-sigmas confidence level, that is we approximate \(z\approx 2\).
For more discussions of confidence intervals (and in particular linked with a discussion of the bootstrap method), see chapter 5 of the textbook by Davison on the Bootstrap Methods and their Applications
In this text you will also find an in-depth discussion of the Bootstrap method, why it works and various theorems related to it.
Resampling methods: Bootstrap background#
Since \(\widehat{\beta} = \widehat{\beta}(\boldsymbol{X})\) is a function of random variables, \(\widehat{\beta}\) itself must be a random variable. Thus it has a pdf, call this function \(p(\boldsymbol{t})\). The aim of the bootstrap is to estimate \(p(\boldsymbol{t})\) by the relative frequency of \(\widehat{\beta}\). You can think of this as using a histogram in the place of \(p(\boldsymbol{t})\). If the relative frequency closely resembles \(p(\vec{t})\), then using numerics, it is straight forward to estimate all the interesting parameters of \(p(\boldsymbol{t})\) using point estimators.
Resampling methods: More Bootstrap background#
In the case that \(\widehat{\beta}\) has more than one component, and the components are independent, we use the same estimator on each component separately. If the probability density function of \(X_i\), \(p(x)\), had been known, then it would have been straightforward to do this by:
Drawing lots of numbers from \(p(x)\), suppose we call one such set of numbers \((X_1^*, X_2^*, \cdots, X_n^*)\).
Then using these numbers, we could compute a replica of \(\widehat{\beta}\) called \(\widehat{\beta}^*\).
By repeated use of the above two points, many estimates of \(\widehat{\beta}\) can be obtained. The idea is to use the relative frequency of \(\widehat{\beta}^*\) (think of a histogram) as an estimate of \(p(\boldsymbol{t})\).
Resampling methods: Bootstrap approach#
But unless there is enough information available about the process that generated \(X_1,X_2,\cdots,X_n\), \(p(x)\) is in general unknown. Therefore, Efron in 1979 asked the question: What if we replace \(p(x)\) by the relative frequency of the observation \(X_i\)?
If we draw observations in accordance with the relative frequency of the observations, will we obtain the same result in some asymptotic sense? The answer is yes.
Resampling methods: Bootstrap steps#
The independent bootstrap works like this:
Draw with replacement \(n\) numbers for the observed variables \(\boldsymbol{x} = (x_1,x_2,\cdots,x_n)\).
Define a vector \(\boldsymbol{x}^*\) containing the values which were drawn from \(\boldsymbol{x}\).
Using the vector \(\boldsymbol{x}^*\) compute \(\widehat{\beta}^*\) by evaluating \(\widehat \beta\) under the observations \(\boldsymbol{x}^*\).
Repeat this process \(k\) times.
When you are done, you can draw a histogram of the relative frequency of \(\widehat \beta^*\). This is your estimate of the probability distribution \(p(t)\). Using this probability distribution you can estimate any statistics thereof. In principle you never draw the histogram of the relative frequency of \(\widehat{\beta}^*\). Instead you use the estimators corresponding to the statistic of interest. For example, if you are interested in estimating the variance of \(\widehat \beta\), apply the etsimator \(\widehat \sigma^2\) to the values \(\widehat \beta^*\).
Code example for the Bootstrap method#
The following code starts with a Gaussian distribution with mean value \(\mu =100\) and variance \(\sigma=15\). We use this to generate the data used in the bootstrap analysis. The bootstrap analysis returns a data set after a given number of bootstrap operations (as many as we have data points). This data set consists of estimated mean values for each bootstrap operation. The histogram generated by the bootstrap method shows that the distribution for these mean values is also a Gaussian, centered around the mean value \(\mu=100\) but with standard deviation \(\sigma/\sqrt{n}\), where \(n\) is the number of bootstrap samples (in this case the same as the number of original data points). The value of the standard deviation is what we expect from the central limit theorem.
%matplotlib inline
import numpy as np
from time import time
from scipy.stats import norm
import matplotlib.pyplot as plt
# Returns mean of bootstrap samples
# Bootstrap algorithm
def bootstrap(data, datapoints):
t = np.zeros(datapoints)
n = len(data)
# non-parametric bootstrap
for i in range(datapoints):
t[i] = np.mean(data[np.random.randint(0,n,n)])
# analysis
print("Bootstrap Statistics :")
print("original bias std. error")
print("%8g %8g %14g %15g" % (np.mean(data), np.std(data),np.mean(t),np.std(t)))
return t
# We set the mean value to 100 and the standard deviation to 15
mu, sigma = 100, 15
datapoints = 10000
# We generate random numbers according to the normal distribution
x = mu + sigma*np.random.randn(datapoints)
# bootstrap returns the data sample
t = bootstrap(x, datapoints)
Bootstrap Statistics :
original bias std. error
100.088 15.1355 100.087 0.151546
We see that our new variance and from that the standard deviation, agrees with the central limit theorem.
Plotting the Histogram#
# the histogram of the bootstrapped data (normalized data if density = True)
n, binsboot, patches = plt.hist(t, 50, density=True, facecolor='red', alpha=0.75)
# add a 'best fit' line
y = norm.pdf(binsboot, np.mean(t), np.std(t))
lt = plt.plot(binsboot, y, 'b', linewidth=1)
plt.xlabel('x')
plt.ylabel('Probability')
plt.grid(True)
plt.show()
The bias-variance tradeoff#
We will discuss the bias-variance tradeoff in the context of continuous predictions such as regression. However, many of the intuitions and ideas discussed here also carry over to classification tasks. Consider a dataset \(\mathcal{D}\) consisting of the data \(\mathbf{X}_\mathcal{D}=\{(y_j, \boldsymbol{x}_j), j=0\ldots n-1\}\).
Let us assume that the true data is generated from a noisy model
where \(\epsilon\) is normally distributed with mean zero and standard deviation \(\sigma^2\).
In our derivation of the ordinary least squares method we defined then an approximation to the function \(f\) in terms of the parameters \(\boldsymbol{\beta}\) and the design matrix \(\boldsymbol{X}\) which embody our model, that is \(\boldsymbol{\tilde{y}}=\boldsymbol{X}\boldsymbol{\beta}\).
Thereafter we found the parameters \(\boldsymbol{\beta}\) by optimizing the means squared error via the so-called cost function
We can rewrite this as
The three terms represent the square of the bias of the learning method, which can be thought of as the error caused by the simplifying assumptions built into the method. The second term represents the variance of the chosen model and finally the last terms is variance of the error \(\boldsymbol{\epsilon}\).
To derive this equation, we need to recall that the variance of \(\boldsymbol{y}\) and \(\boldsymbol{\epsilon}\) are both equal to \(\sigma^2\). The mean value of \(\boldsymbol{\epsilon}\) is by definition equal to zero. Furthermore, the function \(f\) is not a stochastics variable, idem for \(\boldsymbol{\tilde{y}}\). We use a more compact notation in terms of the expectation value
and adding and subtracting \(\mathbb{E}\left[\boldsymbol{\tilde{y}}\right]\) we get
which, using the abovementioned expectation values can be rewritten as
that is the rewriting in terms of the so-called bias, the variance of the model \(\boldsymbol{\tilde{y}}\) and the variance of \(\boldsymbol{\epsilon}\).
A way to Read the Bias-Variance Tradeoff#
Figure 1:
Example code for Bias-Variance tradeoff#
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.utils import resample
np.random.seed(2018)
n = 500
n_boostraps = 100
degree = 18 # A quite high value, just to show.
noise = 0.1
# Make data set.
x = np.linspace(-1, 3, n).reshape(-1, 1)
y = np.exp(-x**2) + 1.5 * np.exp(-(x-2)**2) + np.random.normal(0, 0.1, x.shape)
# Hold out some test data that is never used in training.
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
# Combine x transformation and model into one operation.
# Not neccesary, but convenient.
model = make_pipeline(PolynomialFeatures(degree=degree), LinearRegression(fit_intercept=False))
# The following (m x n_bootstraps) matrix holds the column vectors y_pred
# for each bootstrap iteration.
y_pred = np.empty((y_test.shape[0], n_boostraps))
for i in range(n_boostraps):
x_, y_ = resample(x_train, y_train)
# Evaluate the new model on the same test data each time.
y_pred[:, i] = model.fit(x_, y_).predict(x_test).ravel()
# Note: Expectations and variances taken w.r.t. different training
# data sets, hence the axis=1. Subsequent means are taken across the test data
# set in order to obtain a total value, but before this we have error/bias/variance
# calculated per data point in the test set.
# Note 2: The use of keepdims=True is important in the calculation of bias as this
# maintains the column vector form. Dropping this yields very unexpected results.
error = np.mean( np.mean((y_test - y_pred)**2, axis=1, keepdims=True) )
bias = np.mean( (y_test - np.mean(y_pred, axis=1, keepdims=True))**2 )
variance = np.mean( np.var(y_pred, axis=1, keepdims=True) )
print('Error:', error)
print('Bias^2:', bias)
print('Var:', variance)
print('{} >= {} + {} = {}'.format(error, bias, variance, bias+variance))
plt.plot(x[::5, :], y[::5, :], label='f(x)')
plt.scatter(x_test, y_test, label='Data points')
plt.scatter(x_test, np.mean(y_pred, axis=1), label='Pred')
plt.legend()
plt.show()
Error: 0.013121574015585152
Bias^2: 0.012073649446193166
Var: 0.0010479245693919886
0.013121574015585152 >= 0.012073649446193166 + 0.0010479245693919886 = 0.013121574015585155
Understanding what happens#
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.utils import resample
np.random.seed(2018)
n = 40
n_boostraps = 100
maxdegree = 14
# Make data set.
x = np.linspace(-3, 3, n).reshape(-1, 1)
y = np.exp(-x**2) + 1.5 * np.exp(-(x-2)**2)+ np.random.normal(0, 0.1, x.shape)
error = np.zeros(maxdegree)
bias = np.zeros(maxdegree)
variance = np.zeros(maxdegree)
polydegree = np.zeros(maxdegree)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
for degree in range(maxdegree):
model = make_pipeline(PolynomialFeatures(degree=degree), LinearRegression(fit_intercept=False))
y_pred = np.empty((y_test.shape[0], n_boostraps))
for i in range(n_boostraps):
x_, y_ = resample(x_train, y_train)
y_pred[:, i] = model.fit(x_, y_).predict(x_test).ravel()
polydegree[degree] = degree
error[degree] = np.mean( np.mean((y_test - y_pred)**2, axis=1, keepdims=True) )
bias[degree] = np.mean( (y_test - np.mean(y_pred, axis=1, keepdims=True))**2 )
variance[degree] = np.mean( np.var(y_pred, axis=1, keepdims=True) )
print('Polynomial degree:', degree)
print('Error:', error[degree])
print('Bias^2:', bias[degree])
print('Var:', variance[degree])
print('{} >= {} + {} = {}'.format(error[degree], bias[degree], variance[degree], bias[degree]+variance[degree]))
plt.plot(polydegree, error, label='Error')
plt.plot(polydegree, bias, label='bias')
plt.plot(polydegree, variance, label='Variance')
plt.legend()
plt.show()
Polynomial degree: 0
Error: 0.32149601703519115
Bias^2: 0.3123314713548606
Var: 0.009164545680330616
0.32149601703519115 >= 0.3123314713548606 + 0.009164545680330616 = 0.3214960170351912
Polynomial degree: 1
Error: 0.08426840630693411
Bias^2: 0.0796891867672603
Var: 0.004579219539673834
0.08426840630693411 >= 0.0796891867672603 + 0.004579219539673834 = 0.08426840630693413
Polynomial degree: 2
Error: 0.10398646080125035
Bias^2: 0.1007711427354898
Var: 0.0032153180657605116
0.10398646080125035 >= 0.1007711427354898 + 0.0032153180657605116 = 0.10398646080125032
Polynomial degree: 3
Error: 0.06547790180152355
Bias^2: 0.06208238634231949
Var: 0.0033955154592040936
0.06547790180152355 >= 0.06208238634231949 + 0.0033955154592040936 = 0.06547790180152359
Polynomial degree: 4
Error: 0.06844519414009445
Bias^2: 0.06453579006728324
Var: 0.003909404072811226
0.06844519414009445 >= 0.06453579006728324 + 0.003909404072811226 = 0.06844519414009446
Polynomial degree: 5
Error: 0.05227921801205686
Bias^2: 0.0481872773043029
Var: 0.004091940707753939
0.05227921801205686 >= 0.0481872773043029 + 0.004091940707753939 = 0.052279218012056844
Polynomial degree: 6
Error: 0.037813671417389005
Bias^2: 0.033657685071527665
Var: 0.00415598634586135
0.037813671417389005 >= 0.033657685071527665 + 0.00415598634586135 = 0.03781367141738902
Polynomial degree: 7
Error: 0.02760977349102253
Bias^2: 0.022999498260366312
Var: 0.004610275230656212
0.02760977349102253 >= 0.022999498260366312 + 0.004610275230656212 = 0.027609773491022525
Polynomial degree: 8
Error: 0.017355848195593347
Bias^2: 0.010331721306655127
Var: 0.007024126888938232
0.017355848195593347 >= 0.010331721306655127 + 0.007024126888938232 = 0.01735584819559336
Polynomial degree: 9
Error: 0.02660572763718093
Bias^2: 0.010018312644137363
Var: 0.016587414993043573
0.02660572763718093 >= 0.010018312644137363 + 0.016587414993043573 = 0.026605727637180936
Polynomial degree: 10
Error: 0.021592704588025025
Bias^2: 0.010516485576645508
Var: 0.011076219011379514
0.021592704588025025 >= 0.010516485576645508 + 0.011076219011379514 = 0.021592704588025022
Polynomial degree: 11
Error: 0.07160048164233104
Bias^2: 0.014436800088904942
Var: 0.05716368155342608
0.07160048164233104 >= 0.014436800088904942 + 0.05716368155342608 = 0.07160048164233102
Polynomial degree: 12
Error: 0.11547777218872497
Bias^2: 0.01628578269596628
Var: 0.09919198949275869
0.11547777218872497 >= 0.01628578269596628 + 0.09919198949275869 = 0.11547777218872497
Polynomial degree: 13
Error: 0.22842468702219465
Bias^2: 0.01975416527185249
Var: 0.20867052175034223
0.22842468702219465 >= 0.01975416527185249 + 0.20867052175034223 = 0.2284246870221947
Summing up#
The bias-variance tradeoff summarizes the fundamental tension in machine learning, particularly supervised learning, between the complexity of a model and the amount of training data needed to train it. Since data is often limited, in practice it is often useful to use a less-complex model with higher bias, that is a model whose asymptotic performance is worse than another model because it is easier to train and less sensitive to sampling noise arising from having a finite-sized training dataset (smaller variance).
The above equations tell us that in order to minimize the expected test error, we need to select a statistical learning method that simultaneously achieves low variance and low bias. Note that variance is inherently a nonnegative quantity, and squared bias is also nonnegative. Hence, we see that the expected test MSE can never lie below \(Var(\epsilon)\), the irreducible error.
What do we mean by the variance and bias of a statistical learning method? The variance refers to the amount by which our model would change if we estimated it using a different training data set. Since the training data are used to fit the statistical learning method, different training data sets will result in a different estimate. But ideally the estimate for our model should not vary too much between training sets. However, if a method has high variance then small changes in the training data can result in large changes in the model. In general, more flexible statistical methods have higher variance.
You may also find this recent article of interest.
Another Example from Scikit-Learn’s Repository#
This example demonstrates the problems of underfitting and overfitting and how we can use linear regression with polynomial features to approximate nonlinear functions. The plot shows the function that we want to approximate, which is a part of the cosine function. In addition, the samples from the real function and the approximations of different models are displayed. The models have polynomial features of different degrees. We can see that a linear function (polynomial with degree 1) is not sufficient to fit the training samples. This is called underfitting. A polynomial of degree 4 approximates the true function almost perfectly. However, for higher degrees the model will overfit the training data, i.e. it learns the noise of the training data. We evaluate quantitatively overfitting and underfitting by using cross-validation. We calculate the mean squared error (MSE) on the validation set, the higher, the less likely the model generalizes correctly from the training data.
#print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
def true_fun(X):
return np.cos(1.5 * np.pi * X)
np.random.seed(0)
n_samples = 30
degrees = [1, 4, 15]
X = np.sort(np.random.rand(n_samples))
y = true_fun(X) + np.random.randn(n_samples) * 0.1
plt.figure(figsize=(14, 5))
for i in range(len(degrees)):
ax = plt.subplot(1, len(degrees), i + 1)
plt.setp(ax, xticks=(), yticks=())
polynomial_features = PolynomialFeatures(degree=degrees[i],
include_bias=False)
linear_regression = LinearRegression()
pipeline = Pipeline([("polynomial_features", polynomial_features),
("linear_regression", linear_regression)])
pipeline.fit(X[:, np.newaxis], y)
# Evaluate the models using crossvalidation
scores = cross_val_score(pipeline, X[:, np.newaxis], y,
scoring="neg_mean_squared_error", cv=10)
X_test = np.linspace(0, 1, 100)
plt.plot(X_test, pipeline.predict(X_test[:, np.newaxis]), label="Model")
plt.plot(X_test, true_fun(X_test), label="True function")
plt.scatter(X, y, edgecolor='b', s=20, label="Samples")
plt.xlabel("x")
plt.ylabel("y")
plt.xlim((0, 1))
plt.ylim((-2, 2))
plt.legend(loc="best")
plt.title("Degree {}\nMSE = {:.2e}(+/- {:.2e})".format(
degrees[i], -scores.mean(), scores.std()))
plt.show()
Various steps in cross-validation#
When the repetitive splitting of the data set is done randomly, samples may accidently end up in a fast majority of the splits in either training or test set. Such samples may have an unbalanced influence on either model building or prediction evaluation. To avoid this \(k\)-fold cross-validation structures the data splitting. The samples are divided into \(k\) more or less equally sized exhaustive and mutually exclusive subsets. In turn (at each split) one of these subsets plays the role of the test set while the union of the remaining subsets constitutes the training set. Such a splitting warrants a balanced representation of each sample in both training and test set over the splits. Still the division into the \(k\) subsets involves a degree of randomness. This may be fully excluded when choosing \(k=n\). This particular case is referred to as leave-one-out cross-validation (LOOCV).
Cross-validation in brief#
For the various values of \(k\)
shuffle the dataset randomly.
Split the dataset into \(k\) groups.
For each unique group:
a. Decide which group to use as set for test data
b. Take the remaining groups as a training data set
c. Fit a model on the training set and evaluate it on the test set
d. Retain the evaluation score and discard the model
Summarize the model using the sample of model evaluation scores
Code Example for Cross-validation and \(k\)-fold Cross-validation#
The code here uses Ridge regression with cross-validation (CV) resampling and \(k\)-fold CV in order to fit a specific polynomial.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import KFold
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import PolynomialFeatures
# A seed just to ensure that the random numbers are the same for every run.
# Useful for eventual debugging.
np.random.seed(3155)
# Generate the data.
nsamples = 100
x = np.random.randn(nsamples)
y = 3*x**2 + np.random.randn(nsamples)
## Cross-validation on Ridge regression using KFold only
# Decide degree on polynomial to fit
poly = PolynomialFeatures(degree = 6)
# Decide which values of lambda to use
nlambdas = 500
lambdas = np.logspace(-3, 5, nlambdas)
# Initialize a KFold instance
k = 5
kfold = KFold(n_splits = k)
# Perform the cross-validation to estimate MSE
scores_KFold = np.zeros((nlambdas, k))
i = 0
for lmb in lambdas:
ridge = Ridge(alpha = lmb)
j = 0
for train_inds, test_inds in kfold.split(x):
xtrain = x[train_inds]
ytrain = y[train_inds]
xtest = x[test_inds]
ytest = y[test_inds]
Xtrain = poly.fit_transform(xtrain[:, np.newaxis])
ridge.fit(Xtrain, ytrain[:, np.newaxis])
Xtest = poly.fit_transform(xtest[:, np.newaxis])
ypred = ridge.predict(Xtest)
scores_KFold[i,j] = np.sum((ypred - ytest[:, np.newaxis])**2)/np.size(ypred)
j += 1
i += 1
estimated_mse_KFold = np.mean(scores_KFold, axis = 1)
## Cross-validation using cross_val_score from sklearn along with KFold
# kfold is an instance initialized above as:
# kfold = KFold(n_splits = k)
estimated_mse_sklearn = np.zeros(nlambdas)
i = 0
for lmb in lambdas:
ridge = Ridge(alpha = lmb)
X = poly.fit_transform(x[:, np.newaxis])
estimated_mse_folds = cross_val_score(ridge, X, y[:, np.newaxis], scoring='neg_mean_squared_error', cv=kfold)
# cross_val_score return an array containing the estimated negative mse for every fold.
# we have to the the mean of every array in order to get an estimate of the mse of the model
estimated_mse_sklearn[i] = np.mean(-estimated_mse_folds)
i += 1
## Plot and compare the slightly different ways to perform cross-validation
plt.figure()
plt.plot(np.log10(lambdas), estimated_mse_sklearn, label = 'cross_val_score')
plt.plot(np.log10(lambdas), estimated_mse_KFold, 'r--', label = 'KFold')
plt.xlabel('log10(lambda)')
plt.ylabel('mse')
plt.legend()
plt.show()
More examples on bootstrap and cross-validation and errors#
# Common imports
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.model_selection import train_test_split
from sklearn.utils import resample
from sklearn.metrics import mean_squared_error
# Where to save the figures and data files
PROJECT_ROOT_DIR = "Results"
FIGURE_ID = "Results/FigureFiles"
DATA_ID = "DataFiles/"
if not os.path.exists(PROJECT_ROOT_DIR):
os.mkdir(PROJECT_ROOT_DIR)
if not os.path.exists(FIGURE_ID):
os.makedirs(FIGURE_ID)
if not os.path.exists(DATA_ID):
os.makedirs(DATA_ID)
def image_path(fig_id):
return os.path.join(FIGURE_ID, fig_id)
def data_path(dat_id):
return os.path.join(DATA_ID, dat_id)
def save_fig(fig_id):
plt.savefig(image_path(fig_id) + ".png", format='png')
infile = open(data_path("EoS.csv"),'r')
# Read the EoS data as csv file and organize the data into two arrays with density and energies
EoS = pd.read_csv(infile, names=('Density', 'Energy'))
EoS['Energy'] = pd.to_numeric(EoS['Energy'], errors='coerce')
EoS = EoS.dropna()
Energies = EoS['Energy']
Density = EoS['Density']
# The design matrix now as function of various polytrops
Maxpolydegree = 30
X = np.zeros((len(Density),Maxpolydegree))
X[:,0] = 1.0
testerror = np.zeros(Maxpolydegree)
trainingerror = np.zeros(Maxpolydegree)
polynomial = np.zeros(Maxpolydegree)
trials = 100
for polydegree in range(1, Maxpolydegree):
polynomial[polydegree] = polydegree
for degree in range(polydegree):
X[:,degree] = Density**(degree/3.0)
# loop over trials in order to estimate the expectation value of the MSE
testerror[polydegree] = 0.0
trainingerror[polydegree] = 0.0
for samples in range(trials):
x_train, x_test, y_train, y_test = train_test_split(X, Energies, test_size=0.2)
model = LinearRegression(fit_intercept=False).fit(x_train, y_train)
ypred = model.predict(x_train)
ytilde = model.predict(x_test)
testerror[polydegree] += mean_squared_error(y_test, ytilde)
trainingerror[polydegree] += mean_squared_error(y_train, ypred)
testerror[polydegree] /= trials
trainingerror[polydegree] /= trials
print("Degree of polynomial: %3d"% polynomial[polydegree])
print("Mean squared error on training data: %.8f" % trainingerror[polydegree])
print("Mean squared error on test data: %.8f" % testerror[polydegree])
plt.plot(polynomial, np.log10(trainingerror), label='Training Error')
plt.plot(polynomial, np.log10(testerror), label='Test Error')
plt.xlabel('Polynomial degree')
plt.ylabel('log10[MSE]')
plt.legend()
plt.show()
Degree of polynomial: 1
Mean squared error on training data: 446033.51374050
Mean squared error on test data: 455173.80460179
Degree of polynomial: 2
Mean squared error on training data: 114550.54637219
Mean squared error on test data: 129963.83146596
Degree of polynomial: 3
Mean squared error on training data: 9054.61775176
Mean squared error on test data: 10572.87627342
Degree of polynomial: 4
Mean squared error on training data: 302.15313054
Mean squared error on test data: 433.26292364
Degree of polynomial: 5
Mean squared error on training data: 3.64316192
Mean squared error on test data: 7.23528337
Degree of polynomial: 6
Mean squared error on training data: 3.56589683
Mean squared error on test data: 10.50427787
Degree of polynomial: 7
Mean squared error on training data: 0.47313680
Mean squared error on test data: 1.53738247
Degree of polynomial: 8
Mean squared error on training data: 0.04926746
Mean squared error on test data: 0.14629156
Degree of polynomial: 9
Mean squared error on training data: 0.02546675
Mean squared error on test data: 0.11202337
Degree of polynomial: 10
Mean squared error on training data: 0.02424794
Mean squared error on test data: 0.22467274
Degree of polynomial: 11
Mean squared error on training data: 0.01594452
Mean squared error on test data: 1.07641937
Degree of polynomial: 12
Mean squared error on training data: 0.00805074
Mean squared error on test data: 0.04295757
Degree of polynomial: 13
Mean squared error on training data: 0.00781918
Mean squared error on test data: 0.56965674
Degree of polynomial: 14
Mean squared error on training data: 0.00465099
Mean squared error on test data: 0.28443039
Degree of polynomial: 15
Mean squared error on training data: 0.00420072
Mean squared error on test data: 568.47051432
Degree of polynomial: 16
Mean squared error on training data: 0.00325450
Mean squared error on test data: 48.97630233
Degree of polynomial: 17
Mean squared error on training data: 0.00242954
Mean squared error on test data: 2.52780600
Degree of polynomial: 18
Mean squared error on training data: 0.00219195
Mean squared error on test data: 429.25695398
Degree of polynomial: 19
Mean squared error on training data: 0.00154853
Mean squared error on test data: 239.97065359
Degree of polynomial: 20
Mean squared error on training data: 0.00140846
Mean squared error on test data: 1350.24493666
Degree of polynomial: 21
Mean squared error on training data: 0.00119688
Mean squared error on test data: 1840.50530832
Degree of polynomial: 22
Mean squared error on training data: 0.00092898
Mean squared error on test data: 1184.60929685
Degree of polynomial: 23
Mean squared error on training data: 0.00089193
Mean squared error on test data: 3892.17483760
Degree of polynomial: 24
Mean squared error on training data: 0.00083355
Mean squared error on test data: 1332.46736215
Degree of polynomial: 25
Mean squared error on training data: 0.00079904
Mean squared error on test data: 7577.76690383
Degree of polynomial: 26
Mean squared error on training data: 0.00075590
Mean squared error on test data: 1079.36895644
Degree of polynomial: 27
Mean squared error on training data: 0.00068091
Mean squared error on test data: 3207.25343155
Degree of polynomial: 28
Mean squared error on training data: 0.00063362
Mean squared error on test data: 674.79633065
Degree of polynomial: 29
Mean squared error on training data: 0.00063866
Mean squared error on test data: 3099.60342978
/var/folders/td/3yk470mj5p931p9dtkk0y6jw0000gn/T/ipykernel_23293/626635268.py:73: RuntimeWarning: divide by zero encountered in log10
plt.plot(polynomial, np.log10(trainingerror), label='Training Error')
/var/folders/td/3yk470mj5p931p9dtkk0y6jw0000gn/T/ipykernel_23293/626635268.py:74: RuntimeWarning: divide by zero encountered in log10
plt.plot(polynomial, np.log10(testerror), label='Test Error')
Note that we kept the intercept column in the fitting here. This means that we need to set the intercept in the call to the Scikit-Learn function as False. Alternatively, we could have set up the design matrix \(X\) without the first column of ones.
The same example but now with cross-validation#
In this example we keep the intercept column again but add cross-validation in order to estimate the best possible value of the means squared error.
# Common imports
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
# Where to save the figures and data files
PROJECT_ROOT_DIR = "Results"
FIGURE_ID = "Results/FigureFiles"
DATA_ID = "DataFiles/"
if not os.path.exists(PROJECT_ROOT_DIR):
os.mkdir(PROJECT_ROOT_DIR)
if not os.path.exists(FIGURE_ID):
os.makedirs(FIGURE_ID)
if not os.path.exists(DATA_ID):
os.makedirs(DATA_ID)
def image_path(fig_id):
return os.path.join(FIGURE_ID, fig_id)
def data_path(dat_id):
return os.path.join(DATA_ID, dat_id)
def save_fig(fig_id):
plt.savefig(image_path(fig_id) + ".png", format='png')
infile = open(data_path("EoS.csv"),'r')
# Read the EoS data as csv file and organize the data into two arrays with density and energies
EoS = pd.read_csv(infile, names=('Density', 'Energy'))
EoS['Energy'] = pd.to_numeric(EoS['Energy'], errors='coerce')
EoS = EoS.dropna()
Energies = EoS['Energy']
Density = EoS['Density']
# The design matrix now as function of various polytrops
Maxpolydegree = 30
X = np.zeros((len(Density),Maxpolydegree))
X[:,0] = 1.0
estimated_mse_sklearn = np.zeros(Maxpolydegree)
polynomial = np.zeros(Maxpolydegree)
k =5
kfold = KFold(n_splits = k)
for polydegree in range(1, Maxpolydegree):
polynomial[polydegree] = polydegree
for degree in range(polydegree):
X[:,degree] = Density**(degree/3.0)
OLS = LinearRegression(fit_intercept=False)
# loop over trials in order to estimate the expectation value of the MSE
estimated_mse_folds = cross_val_score(OLS, X, Energies, scoring='neg_mean_squared_error', cv=kfold)
#[:, np.newaxis]
estimated_mse_sklearn[polydegree] = np.mean(-estimated_mse_folds)
plt.plot(polynomial, np.log10(estimated_mse_sklearn), label='Test Error')
plt.xlabel('Polynomial degree')
plt.ylabel('log10[MSE]')
plt.legend()
plt.show()
/var/folders/td/3yk470mj5p931p9dtkk0y6jw0000gn/T/ipykernel_23293/3817475779.py:63: RuntimeWarning: divide by zero encountered in log10
plt.plot(polynomial, np.log10(estimated_mse_sklearn), label='Test Error')
Material for the lab sessions#
Linking the regression analysis with a statistical interpretation#
We will now couple the discussions of ordinary least squares, Ridge and Lasso regression with a statistical interpretation, that is we move from a linear algebra analysis to a statistical analysis. In particular, we will focus on what the regularization terms can result in. We will amongst other things show that the regularization parameter can reduce considerably the variance of the parameters \(\beta\).
The
advantage of doing linear regression is that we actually end up with
analytical expressions for several statistical quantities.
Standard least squares and Ridge regression allow us to
derive quantities like the variance and other expectation values in a
rather straightforward way.
It is assumed that \(\varepsilon_i \sim \mathcal{N}(0, \sigma^2)\) and the \(\varepsilon_{i}\) are independent, i.e.:
The randomness of \(\varepsilon_i\) implies that \(\mathbf{y}_i\) is also a random variable. In particular, \(\mathbf{y}_i\) is normally distributed, because \(\varepsilon_i \sim \mathcal{N}(0, \sigma^2)\) and \(\mathbf{X}_{i,\ast} \, \boldsymbol{\beta}\) is a non-random scalar. To specify the parameters of the distribution of \(\mathbf{y}_i\) we need to calculate its first two moments.
Recall that \(\boldsymbol{X}\) is a matrix of dimensionality \(n\times p\). The notation above \(\mathbf{X}_{i,\ast}\) means that we are looking at the row number \(i\) and perform a sum over all values \(p\).
Assumptions made#
The assumption we have made here can be summarized as (and this is going to be useful when we discuss the bias-variance trade off) that there exists a function \(f(\boldsymbol{x})\) and a normal distributed error \(\boldsymbol{\varepsilon}\sim \mathcal{N}(0, \sigma^2)\) which describe our data
We approximate this function with our model from the solution of the linear regression equations, that is our function \(f\) is approximated by \(\boldsymbol{\tilde{y}}\) where we want to minimize \((\boldsymbol{y}-\boldsymbol{\tilde{y}})^2\), our MSE, with
Expectation value and variance#
We can calculate the expectation value of \(\boldsymbol{y}\) for a given element \(i\)
while its variance is
Hence, \(y_i \sim \mathcal{N}( \mathbf{X}_{i, \ast} \, \boldsymbol{\beta}, \sigma^2)\), that is \(\boldsymbol{y}\) follows a normal distribution with mean value \(\boldsymbol{X}\boldsymbol{\beta}\) and variance \(\sigma^2\) (not be confused with the singular values of the SVD).
Expectation value and variance for \(\boldsymbol{\beta}\)#
With the OLS expressions for the optimal parameters \(\boldsymbol{\hat{\beta}}\) we can evaluate the expectation value
This means that the estimator of the regression parameters is unbiased.
We can also calculate the variance
The variance of the optimal value \(\boldsymbol{\hat{\beta}}\) is
where we have used that \(\mathbb{E} (\mathbf{y} \mathbf{y}^{T}) = \mathbf{X} \, \boldsymbol{\beta} \, \boldsymbol{\beta}^{T} \, \mathbf{X}^{T} + \sigma^2 \, \mathbf{I}_{nn}\). From \(\mbox{Var}(\boldsymbol{\beta}) = \sigma^2 \, (\mathbf{X}^{T} \mathbf{X})^{-1}\), one obtains an estimate of the variance of the estimate of the \(j\)-th regression coefficient: \(\boldsymbol{\sigma}^2 (\boldsymbol{\beta}_j ) = \boldsymbol{\sigma}^2 [(\mathbf{X}^{T} \mathbf{X})^{-1}]_{jj} \). This may be used to construct a confidence interval for the estimates.
In a similar way, we can obtain analytical expressions for say the expectation values of the parameters \(\boldsymbol{\beta}\) and their variance when we employ Ridge regression, allowing us again to define a confidence interval.
It is rather straightforward to show that
We see clearly that \(\mathbb{E} \big[ \hat{\boldsymbol{\beta}}^{\mathrm{Ridge}} \big] \not= \hat{\boldsymbol{\beta}}^{\mathrm{OLS}}\) for any \(\lambda > 0\).
We can also compute the variance as
and it is easy to see that if the parameter \(\lambda\) goes to infinity then the variance of Ridge parameters \(\boldsymbol{\beta}\) goes to zero.
With this, we can compute the difference
The difference is non-negative definite since each component of the matrix product is non-negative definite. This means the variance we obtain with the standard OLS will always for \(\lambda > 0\) be larger than the variance of \(\boldsymbol{\beta}\) obtained with the Ridge estimator. This has interesting consequences when we discuss the so-called bias-variance trade-off below.
For more discussions of Ridge regression and calculation of averages, Wessel van Wieringen’s article is highly recommended.