Data Analysis and Machine Learning Lectures: Optimization and Gradient Methods

Gradient descent example

Let $ \mathbf{y} = (y_1,\cdots,y_n)^T $, $ \mathbf{\hat{y}} = (\hat{y}_1,\cdots,\hat{y}_n)^T $ and $ \beta = (\beta_0, \beta_1)^T $

It is convenient to write $ \mathbf{\hat{y}} = X\beta $ where $ X \in \mathbb{R}^{100 \times 2} $ is the design matrix given by $$ X \equiv \begin{bmatrix} 1 & x_1 \\ \vdots & \vdots \\ 1 & x_{100} & \\ \end{bmatrix}. $$ The loss function is given by $$ C(\beta) = ||X\beta-\mathbf{y}||^2 = ||X\beta||^2 - 2 \mathbf{y}^T X\beta + ||\mathbf{y}||^2 = \sum_{i=1}^{100} (\beta_0 + \beta_1 x_i)^2 - 2 y_i (\beta_0 + \beta_1 x_i) + y_i^2 $$ and we want to find $ \beta $ such that $ C(\beta) $ is minimized.