We can now write a program that minimizes \( C(\beta) \) using the gradient descent method with a constant learning rate \( \gamma \) according to
$$ \beta_{k+1} = \beta_k - \gamma \nabla_\beta C(\beta_k), \ k=0,1,\cdots $$We can use the expression we computed for the gradient and let use a \( \beta_0 \) be chosen randomly and let \( \gamma = 0.001 \). Stop iterating when \( ||\nabla_\beta C(\beta_k) || \leq \epsilon = 10^{-8} \). Note that the code below does not include the latter stop criterion.
And finally we can compare our solution for \( \beta \) with the analytic result given by \( \beta= (X^TX)^{-1} X^T \mathbf{y} \).