Exercises week 37#

September 15-19#

Resampling and the Bias-Variance Trade-off#

Learning goals#

After completing these exercises, you will know how to

  • Derive expectation and variances values related to linear regression

  • Compute expectation and variances values related to linear regression

  • Compute and evaluate the trade-off between bias and variance of a model

This week deals with various mean values and variances in linear regression methods (here it may be useful to look up chapter 3, equation (3.8) of Trevor Hastie, Robert Tibshirani, Jerome H. Friedman, The Elements of Statistical Learning, Springer). The exercises are also a part of project 1 and can be reused in the theory part of the project.

For more discussions on Ridge regression and calculation of expectation values, Wessel van Wieringen’s article is highly recommended.

We assume that there exists a continuous function \(f(\boldsymbol{x})\) and a normal distributed error \(\boldsymbol{\varepsilon}\sim N(0, \sigma^2)\) which describes our data

\[ \boldsymbol{y} = f(\boldsymbol{x})+\boldsymbol{\varepsilon} \]

We further assume that this continous function can be modeled with a linear model \(\mathbf{\tilde{y}}\) of some features \(\mathbf{X}\).

\[ \boldsymbol{y} = \boldsymbol{\tilde{y}} + \boldsymbol{\varepsilon} = \boldsymbol{X}\boldsymbol{\beta} +\boldsymbol{\varepsilon} \]

We therefore get that our data \(\boldsymbol{y}\) has an expectation value \(\boldsymbol{X}\boldsymbol{\beta}\) and variance \(\sigma^2\), that is \(\boldsymbol{y}\) follows a normal distribution with mean value \(\boldsymbol{X}\boldsymbol{\beta}\) and variance \(\sigma^2\).

Exercise 1: Expectation values for ordinary least squares expressions#

a) With the expressions for the optimal parameters \(\boldsymbol{\hat{\beta}_{OLS}}\) show that

\[ \mathbb{E}(\boldsymbol{\hat{\beta}_{OLS}}) = \boldsymbol{\beta}. \]

b) Show that the variance of \(\boldsymbol{\hat{\beta}_{OLS}}\) is

\[ \mathbf{Var}(\boldsymbol{\hat{\beta}_{OLS}}) = \sigma^2 \, (\mathbf{X}^{T} \mathbf{X})^{-1}. \]

We can use the last expression when we define a confidence interval for the parameters \(\boldsymbol{\hat{\beta}_{OLS}}\). A given parameter \({\boldsymbol{\hat{\beta}_{OLS}}}_j\) is given by the diagonal matrix element of the above matrix.

Exercise 2: Expectation values for Ridge regression#

a) With the expressions for the optimal parameters \(\boldsymbol{\hat{\beta}_{Ridge}}\) show that

\[ \mathbb{E} \big[ \hat{\boldsymbol{\beta}}^{\mathrm{Ridge}} \big]=(\mathbf{X}^{T} \mathbf{X} + \lambda \mathbf{I}_{pp})^{-1} (\mathbf{X}^{\top} \mathbf{X})\boldsymbol{\beta} \]

We see that \(\mathbb{E} \big[ \hat{\boldsymbol{\beta}}^{\mathrm{Ridge}} \big] \not= \mathbb{E} \big[\hat{\boldsymbol{\beta}}^{\mathrm{OLS}}\big ]\) for any \(\lambda > 0\).

b) Why do we say that Ridge regression gives a biased estimate? Is this a problem?

c) Show that the variance is

\[ \mathbf{Var}[\hat{\boldsymbol{\beta}}^{\mathrm{Ridge}}]=\sigma^2[ \mathbf{X}^{T} \mathbf{X} + \lambda \mathbf{I} ]^{-1} \mathbf{X}^{T}\mathbf{X} \{ [ \mathbf{X}^{\top} \mathbf{X} + \lambda \mathbf{I} ]^{-1}\}^{T} \]

We see that if the parameter \(\lambda\) goes to infinity then the variance of the Ridge parameters \(\boldsymbol{\beta}\) goes to zero.

Exercise 3: Deriving the expression for the Bias-Variance Trade-off#

The aim of this exercise is to derive the equations for the bias-variance tradeoff to be used in project 1.

The parameters \(\boldsymbol{\hat{\beta}_{OLS}}\) are found by optimizing the mean squared error via the so-called cost function

\[ C(\boldsymbol{X},\boldsymbol{\beta}) =\frac{1}{n}\sum_{i=0}^{n-1}(y_i-\tilde{y}_i)^2=\mathbb{E}\left[(\boldsymbol{y}-\boldsymbol{\tilde{y}})^2\right] \]

a) Show that you can rewrite this in terms of a term which contains the variance of the model itself (the so-called variance term), a term which measures the deviation from the true data and the mean value of the model (the bias term) and finally the variance of the noise Note that in order to be able to evaluate the bias them, you will need to approximate the function \(f\) with the model \({\bf y}\). show that

\[ \mathbb{E}\left[(\boldsymbol{y}-\boldsymbol{\tilde{y}})^2\right]=\mathrm{Bias}[\tilde{y}]+\mathrm{var}[\tilde{y}]+\sigma^2, \]

with

\[ \mathrm{Bias}[\tilde{y}]=\mathbb{E}\left[\left(\boldsymbol{y}-\mathbb{E}\left[\boldsymbol{\tilde{y}}\right]\right)^2\right], \]

and

\[ \mathrm{var}[\tilde{y}]=\mathbb{E}\left[\left(\tilde{\boldsymbol{y}}-\mathbb{E}\left[\boldsymbol{\tilde{y}}\right]\right)^2\right]=\frac{1}{n}\sum_i(\tilde{y}_i-\mathbb{E}\left[\boldsymbol{\tilde{y}}\right])^2. \]

b) Explain what the terms mean and discuss their interpretations.

Exercise 4: Computing the Bias and Variance#

Before you compute the bias and variance of a real model for different complexities, let’s for now assume that you have sampled predictions and targets for a single model complexity using bootstrap resampling.

a) Using the expression above, compute the mean squared error, bias and variance of the given data. Check that the sum of the bias and variance correctly gives (approximately) the mean squared error.

import numpy as np

n = 100
bootstraps = 1000

predictions = np.random.rand(bootstraps, n) * 10 + 10
targets = np.random.rand(bootstraps, n)

mse = ...
bias = ...
variance = ...

b) Change the prediction values in some way to increase the bias while decreasing the variance.

c) Change the prediction values in some way to increase the variance while decreasing the bias.

d) Perform a bias-variance analysis of a polynomial OLS model fit to a one-dimensional function by computing and plotting the bias and variances values as a function of the polynomial degree of your model.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures # use the fit_transform method of the created object!
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.utils import resample
n = 100
bootstraps = 1000

x = np.linspace(-3, 3, n)
y = np.exp(-x**2) + 1.5 * np.exp(-(x-2)**2) + np.random.normal(0, 0.1)

biases = []
variances = []
mses = []

#for p in range(1, 5):
#    predictions = ...
#    targets = ...
#    for b in range(bootstraps):
#        x_sample, y_sample = ...
#        X = ...
#        X_train, X_test, y_train, y_test = ...
#
#        predictions[b, :] = 
#        targets[b, :] = 
#        
#    biases.append(...)
#    variances.append(...)
#    mses.append(...)

e) Discuss the bias-variance trade-off as function of your model complexity (the degree of the polynomial).

f) Compute and discuss the bias and variance as function of the number of data points (choose a suitable polynomial degree to show something interesiting).