We can now write a program that minimizes \( C(\beta) \) using the gradient descent method with a constant learning rate \( \gamma \) according to $$ \beta_{k+1} = \beta_k - \gamma \nabla_\beta C(\beta_k), \ k=0,1,\cdots $$
We can use the expression we computed for the gradient and let use a \( \beta_0 \) be chosen randomly and let \( \gamma = 0.001 \). Stop iterating when \( ||\nabla_\beta C(\beta_k) || \leq \epsilon = 10^{-8} \).
And finally we can compare our solution for \( \beta \) with the analytic result given by \( \beta= (X^TX)^{-1} X^T \mathbf{y} \).
import numpy as np
"""
The following setup is just a suggestion, feel free to write it the way you like.
"""
#Setup problem described in the exercise
N = 100 #Nr of datapoints
M = 2 #Nr of features
x = np.random.rand(N) #Uniformly generated x-values in [0,1]
y = 5*x**2 + 0.1*np.random.randn(N)
X = np.c_[np.ones(N),x] #Construct design matrix
#Compute beta according to normal equations to compare with GD solution
Xt_X_inv = np.linalg.inv(np.dot(X.T,X))
Xt_y = np.dot(X.transpose(),y)
beta_NE = np.dot(Xt_X_inv,Xt_y)
print(beta_NE)