Exercises week 37#
Implementing gradient descent for Ridge and ordinary Least Squares Regression
Date: September 8-12, 2025
Learning goals#
After having completed these exercises you will have:
Your own code for the implementation of the simplest gradient descent approach applied to ordinary least squares (OLS) and Ridge regression
Be able to compare the analytical expressions for OLS and Ridge regression with the gradient descent approach
Explore the role of the learning rate in the gradient descent approach and the hyperparameter \(\lambda\) in Ridge regression
Scale the data properly
Simple one-dimensional second-order polynomial#
We start with a very simple function
defined for \(x\in [-2,2]\). You can add noise if you wish.
We are going to fit this function with a polynomial ansatz. The easiest thing is to set up a second-order polynomial and see if you can fit the above function. Feel free to play around with higher-order polynomials.
Exercise 1, scale your data#
Before fitting a regression model, it is good practice to normalize or standardize the features. This ensures all features are on a comparable scale, which is especially important when using regularization. Here we will perform standardization, scaling each feature to have mean 0 and standard deviation 1.
1a)#
Compute the mean and standard deviation of each column (feature) in your design/feature matrix \(\boldsymbol{X}\). Subtract the mean and divide by the standard deviation for each feature.
We will also center the target \(\boldsymbol{y}\) to mean \(0\). Centering \(\boldsymbol{y}\) (and each feature) means the model does not require a separate intercept term, the data is shifted such that the intercept is effectively 0 . (In practice, one could include an intercept in the model and not penalize it, but here we simplify by centering.) Choose \(n=100\) data points and set up \(\boldsymbol{x}\), \(\boldsymbol{y}\) and the design matrix \(\boldsymbol{X}\).
# Standardize features (zero mean, unit variance for each feature)
X_mean = X.mean(axis=0)
X_std = X.std(axis=0)
X_std[X_std == 0] = 1 # safeguard to avoid division by zero for constant features
X_norm = (X - X_mean) / X_std
# Center the target to zero mean (optional, to simplify intercept handling)
y_mean = ?
y_centered = ?
Fill in the necessary details. Do we need to center the \(y\)-values?
After this preprocessing, each column of \(\boldsymbol{X}_{\mathrm{norm}}\) has mean zero and standard deviation \(1\) and \(\boldsymbol{y}_{\mathrm{centered}}\) has mean 0. This makes the optimization landscape nicer and ensures the regularization penalty \(\lambda \sum_j \theta_j^2\) in Ridge regression treats each coefficient fairly (since features are on the same scale).
Exercise 2, calculate the gradients#
Find the gradients for OLS and Ridge regression using the mean-squared error as cost/loss function.
Exercise 3, using the analytical formulae for OLS and Ridge regression to find the optimal paramters \(\boldsymbol{\theta}\)#
# Set regularization parameter, either a single value or a vector of values
# Note that lambda is a python keyword. The lambda keyword is used to create small, single-expression functions without a formal name. These are often called "anonymous functions" or "lambda functions."
lam = ?
# Analytical form for OLS and Ridge solution: theta_Ridge = (X^T X + lambda * I)^{-1} X^T y and theta_OLS = (X^T X)^{-1} X^T y
I = np.eye(n_features)
theta_closed_formRidge = ?
theta_closed_formOLS = ?
print("Closed-form Ridge coefficients:", theta_closed_form)
print("Closed-form OLS coefficients:", theta_closed_form)
This computes the Ridge and OLS regression coefficients directly. The identity matrix \(I\) has the same size as \(X^T X\). It adds \(\lambda\) to the diagonal of \(X^T X\) for Ridge regression. We then invert this matrix and multiply by \(X^T y\). The result for \(\boldsymbol{\theta}\) is a NumPy array of shape (n\(\_\)features,) containing the fitted parameters \(\boldsymbol{\theta}\).
3a)#
Finalize, in the above code, the OLS and Ridge regression determination of the optimal parameters \(\boldsymbol{\theta}\).
3b)#
Explore the results as function of different values of the hyperparameter \(\lambda\). See for example exercise 4 from week 36.
Exercise 4, Implementing the simplest form for gradient descent#
Alternatively, we can fit the ridge regression model using gradient descent. This is useful to visualize the iterative convergence and is necessary if \(n\) and \(p\) are so large that the closed-form might be too slow or memory-intensive. We derive the gradients from the cost functions defined above. Use the gradients of the Ridge and OLS cost functions with respect to the parameters \(\boldsymbol{\theta}\) and set up (using the template below) your own gradient descent code for OLS and Ridge regression.
Below is a template code for gradient descent implementation of ridge:
# Gradient descent parameters, learning rate eta first
eta = 0.1
# Then number of iterations
num_iters = 1000
# Initialize weights for gradient descent
theta = np.zeros(n_features)
# Gradient descent loop
for t in range(num_iters):
# Compute gradients for OSL and Ridge
grad_OLS = ?
grad_Ridge = ?
# Update parameters theta
theta_gdOLS = ?
theta_gdRidge = ?
# After the loop, theta contains the fitted coefficients
theta_gdOLS = ?
theta_gdRidge = ?
print("Gradient Descent OLS coefficients:", theta_gdOLS)
print("Gradient Descent Ridge coefficients:", theta_gdRidge)
4a)#
Write first a gradient descent code for OLS only using the above template. Discuss the results as function of the learning rate parameters and the number of iterations
4b)#
Write then a similar code for Ridge regression using the above template. Try to add a stopping parameter as function of the number iterations and the difference between the new and old \(\theta\) values. How would you define a stopping criterion?
Exercise 5, Ridge regression and a new Synthetic Dataset#
We create a synthetic linear regression dataset with a sparse underlying relationship. This means we have many features but only a few of them actually contribute to the target. In our example, we’ll use 10 features with only 3 non-zero weights in the true model. This way, the target is generated as a linear combination of a few features (with known coefficients) plus some random noise. The steps we include are:
Decide on the number of samples and features (e.g. 100 samples, 10 features). Define the true coefficient vector with mostly zeros (for sparsity). For example, we set \(\hat{\boldsymbol{\theta}} = [5.0, -3.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0]\), meaning only features 0, 1, and 6 have a real effect on y.
Then we sample feature values for \(\boldsymbol{X}\) randomly (e.g. from a normal distribution). We use a normal distribution so features are roughly centered around 0. Then we compute the target values \(y\) using the linear combination \(\boldsymbol{X}\hat{\boldsymbol{\theta}}\) and add some noise (to simulate measurement error or unexplained variance).
Below is the code to generate the dataset:
import numpy as np
# Set random seed for reproducibility
np.random.seed(0)
# Define dataset size
n_samples = 100
n_features = 10
# Define true coefficients (sparse linear relationship)
theta_true = np.array([5.0, -3.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0])
# Generate feature matrix X (n_samples x n_features) with random values
X = np.random.randn(n_samples, n_features) # standard normal distribution
# Generate target values y with a linear combination of X and theta_true, plus noise
noise = 0.5 * np.random.randn(n_samples) # Gaussian noise
y = X.dot @ theta_true + noise
This code produces a dataset where only features 0, 1, and 6 significantly influence \(\boldsymbol{y}\). The rest of the features have zero true coefficient. For example, feature 0 has a true weight of 5.0, feature 1 has -3.0, and feature 6 has 2.0, so the expected relationship is:
You can remove the noise if you wish to.
Try to fit the above data set using OLS and Ridge regression with the analytical expressions and your own gradient descent codes.
If everything worked correctly, the learned coefficients should be close to the true values [5.0, -3.0, 0.0, …, 2.0, …] that we used to generate the data. Keep in mind that due to regularization and noise, the learned values will not exactly equal the true ones, but they should be in the same ballpark. Which method (OLS or Ridge) gives the best results?