Exercises week 38#

September 15-19#

Resampling and the Bias-Variance Trade-off#

Learning goals#

After completing these exercises, you will know how to

  • Derive expectation and variances values related to linear regression

  • Compute expectation and variances values related to linear regression

  • Compute and evaluate the trade-off between bias and variance of a model

Deliverables#

Complete the following exercises while working in a jupyter notebook. Then, in canvas, include

Use the books!#

This week deals with various mean values and variances in linear regression methods (here it may be useful to look up chapter 3, equation (3.8) of Trevor Hastie, Robert Tibshirani, Jerome H. Friedman, The Elements of Statistical Learning, Springer).

For more discussions on Ridge regression and calculation of expectation values, Wessel van Wieringen’s article is highly recommended.

The exercises this week are also a part of project 1 and can be reused in the theory part of the project.

Definitions#

We assume that there exists a continuous function \(f(\boldsymbol{x})\) and a normal distributed error \(\boldsymbol{\varepsilon}\sim N(0, \sigma^2)\) which describes our data

\[ \boldsymbol{y} = f(\boldsymbol{x})+\boldsymbol{\varepsilon} \]

We further assume that this continous function can be modeled with a linear model \(\mathbf{\tilde{y}}\) of some features \(\mathbf{X}\).

\[ \boldsymbol{y} = \boldsymbol{\tilde{y}} + \boldsymbol{\varepsilon} = \boldsymbol{X}\boldsymbol{\beta} +\boldsymbol{\varepsilon} \]

We therefore get that our data \(\boldsymbol{y}\) has an expectation value \(\boldsymbol{X}\boldsymbol{\beta}\) and variance \(\sigma^2\), that is \(\boldsymbol{y}\) follows a normal distribution with mean value \(\boldsymbol{X}\boldsymbol{\beta}\) and variance \(\sigma^2\).

Exercise 1: Expectation values for ordinary least squares expressions#

a) With the expressions for the optimal parameters \(\boldsymbol{\hat{\beta}_{OLS}}\) show that

\[ \mathbb{E}(\boldsymbol{\hat{\beta}_{OLS}}) = \boldsymbol{\beta}. \]

b) Show that the variance of \(\boldsymbol{\hat{\beta}_{OLS}}\) is

\[ \mathbf{Var}(\boldsymbol{\hat{\beta}_{OLS}}) = \sigma^2 \, (\mathbf{X}^{T} \mathbf{X})^{-1}. \]

We can use the last expression when we define a confidence interval for the parameters \(\boldsymbol{\hat{\beta}_{OLS}}\). A given parameter \({\boldsymbol{\hat{\beta}_{OLS}}}_j\) is given by the diagonal matrix element of the above matrix.

Exercise 2: Expectation values for Ridge regression#

a) With the expressions for the optimal parameters \(\boldsymbol{\hat{\beta}_{Ridge}}\) show that

\[ \mathbb{E} \big[ \hat{\boldsymbol{\beta}}^{\mathrm{Ridge}} \big]=(\mathbf{X}^{T} \mathbf{X} + \lambda \mathbf{I}_{pp})^{-1} (\mathbf{X}^{\top} \mathbf{X})\boldsymbol{\beta} \]

We see that \(\mathbb{E} \big[ \hat{\boldsymbol{\beta}}^{\mathrm{Ridge}} \big] \not= \mathbb{E} \big[\hat{\boldsymbol{\beta}}^{\mathrm{OLS}}\big ]\) for any \(\lambda > 0\).

b) Show that the variance is

\[ \mathbf{Var}[\hat{\boldsymbol{\beta}}^{\mathrm{Ridge}}]=\sigma^2[ \mathbf{X}^{T} \mathbf{X} + \lambda \mathbf{I} ]^{-1} \mathbf{X}^{T}\mathbf{X} \{ [ \mathbf{X}^{\top} \mathbf{X} + \lambda \mathbf{I} ]^{-1}\}^{T} \]

We see that if the parameter \(\lambda\) goes to infinity then the variance of the Ridge parameters \(\boldsymbol{\beta}\) goes to zero.

Exercise 3: Deriving the expression for the Bias-Variance Trade-off#

The aim of this exercise is to derive the equations for the bias-variance tradeoff to be used in project 1.

The parameters \(\boldsymbol{\hat{\beta}_{OLS}}\) are found by optimizing the mean squared error via the so-called cost function

\[ C(\boldsymbol{X},\boldsymbol{\beta}) =\frac{1}{n}\sum_{i=0}^{n-1}(y_i-\tilde{y}_i)^2=\mathbb{E}\left[(\boldsymbol{y}-\boldsymbol{\tilde{y}})^2\right] \]

a) Show that you can rewrite this into an expression which contains

  • the variance of the model (the variance term)

  • the expected deviation of the mean of the model from the true data (the bias term)

  • the variance of the noise

In other words, show that:

\[ \mathbb{E}\left[(\boldsymbol{y}-\boldsymbol{\tilde{y}})^2\right]=\mathrm{Bias}[\tilde{y}]+\mathrm{var}[\tilde{y}]+\sigma^2, \]

with

\[ \mathrm{Bias}[\tilde{y}]=\mathbb{E}\left[\left(\boldsymbol{y}-\mathbb{E}\left[\boldsymbol{\tilde{y}}\right]\right)^2\right], \]

and

\[ \mathrm{var}[\tilde{y}]=\mathbb{E}\left[\left(\tilde{\boldsymbol{y}}-\mathbb{E}\left[\boldsymbol{\tilde{y}}\right]\right)^2\right]=\frac{1}{n}\sum_i(\tilde{y}_i-\mathbb{E}\left[\boldsymbol{\tilde{y}}\right])^2. \]

In order to arrive at the equation for the bias, we have to approximate the unknown function \(f\) with the output/target values \(y\).

b) Explain what the terms mean and discuss their interpretations.

Exercise 4: Computing the Bias and Variance#

Before you compute the bias and variance of a real model for different complexities, let’s for now assume that you have sampled predictions and targets for a single model complexity using bootstrap resampling.

a) Using the expression above, compute the mean squared error, bias and variance of the given data. Check that the sum of the bias and variance correctly gives (approximately) the mean squared error.

import numpy as np

n = 100
bootstraps = 1000

predictions = np.random.rand(bootstraps, n) * 10 + 10
# The definition of targets has been updated, and was wrong earlier in the week.
targets = np.random.rand(1, n)

mse = ...
bias = ...
variance = ...

b) Change the prediction values in some way to increase the bias while decreasing the variance.

c) Change the prediction values in some way to increase the variance while decreasing the bias.

d) Perform a bias-variance analysis of a polynomial OLS model fit to a one-dimensional function by computing and plotting the bias and variances values as a function of the polynomial degree of your model.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import (
    PolynomialFeatures,
)  # use the fit_transform method of the created object!
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.utils import resample
n = 100
bootstraps = 1000

x = np.linspace(-3, 3, n)
y = np.exp(-(x**2)) + 1.5 * np.exp(-((x - 2) ** 2)) + np.random.normal(0, 0.1)

biases = []
variances = []
mses = []

# for p in range(1, 5):
#    predictions = ...
#    targets = ...
#
#    X = ...
#    X_train, X_test, y_train, y_test = ...
#    for b in range(bootstraps):
#        X_train_re, y_train_re = ...
#
#        # fit your model on the sampled data
#
#        # make predictions on the test data
#        predictions[b, :] =
#        targets[b, :] =
#
#    biases.append(...)
#    variances.append(...)
#    mses.append(...)

e) Discuss the bias-variance trade-off as function of your model complexity (the degree of the polynomial).

f) Compute and discuss the bias and variance as function of the number of data points (choose a suitable polynomial degree to show something interesting).

Exercise 5: Interpretation of scaling and metrics#

In this course, we often ask you to scale data and compute various metrics. Although these practices are “standard” in the field, we will require you to demonstrate an understanding of why you need to scale data and use these metrics. Both so that you can make better arguements about your results, and so that you will hopefully make fewer mistakes.

First, a few reminders: In this course you should always scale the columns of the feature matrix, and sometimes scale the target data, when it is worth the effort. By scaling, we mean subtracting the mean and dividing by the standard deviation, though there are many other ways to scale data. When scaling either the feature matrix or the target data, the intercept becomes a bit harder to implement and understand, so take care.

Briefly answer the following:

a) Why do we scale data?

b) Why does the OLS method give practically equivelent models on scaled and unscaled data?

c) Why does the Ridge method not give practically equivelent models on scaled and unscaled data? Why do we only consider the model on scaled data correct?

d) Why do we say that the Ridge method gives a biased model?

e) Is the MSE of the OLS method affected by scaling of the feature matrix? Is it affected by scaling of the target data?

f) Read about the R2 score, a metric we will ask you to use a lot later in the course. Is the R2 score of the OLS method affected by scaling of the feature matrix? Is it affected by scaling of the target data?

g) Give interpretations of the following R2 scores: 0, 0.5, 1.

h) What is an advantage of the R2 score over the MSE?