Gradient descent example

Let \( \mathbf{y} = (y_1,\cdots,y_n)^T \), \( \mathbf{\hat{y}} = (\hat{y}_1,\cdots,\hat{y}_n)^T \) and \( \beta = (\beta_0, \beta_1)^T \)

It is convenient to write \( \mathbf{\hat{y}} = X\beta \) where \( X \in \mathbb{R}^{100 \times 2} \) is the design matrix given by $$ X \equiv \begin{bmatrix} 1 & x_1 \\ \vdots & \vdots \\ 1 & x_{100} & \\ \end{bmatrix}. $$ The loss function is given by $$ C(\beta) = ||X\beta-\mathbf{y}||^2 = ||X\beta||^2 - 2 \mathbf{y}^T X\beta + ||\mathbf{y}||^2 = \sum_{i=1}^{100} (\beta_0 + \beta_1 x_i)^2 - 2 y_i (\beta_0 + \beta_1 x_i) + y_i^2 $$ and we want to find \( \beta \) such that \( C(\beta) \) is minimized.