We defined earlier a possible cost function using the mean squared error
$$ C(\boldsymbol{\beta})=\frac{1}{n}\sum_{i=0}^{n-1}\left(y_i-\tilde{y}_i\right)^2=\frac{1}{n}\left\{\left(\boldsymbol{y}-\boldsymbol{\tilde{y}}\right)^T\left(\boldsymbol{y}-\boldsymbol{\tilde{y}}\right)\right\}, $$or using the design/feature matrix \( \boldsymbol{X} \) we have the more compact matrix-vector
$$ C(\boldsymbol{\beta})=\frac{1}{n}\left\{\left(\boldsymbol{y}-\boldsymbol{X}\boldsymbol{\beta}\right)^T\left(\boldsymbol{y}-\boldsymbol{X}\boldsymbol{\beta}\right)\right\}. $$We note that the design matrix \( \boldsymbol{X} \) does not depend on the unknown parameters defined by the vector \( \boldsymbol{\beta} \). We are now interested in minimizing the cost function with respect to the unknown parameters \( \boldsymbol{\beta} \).
The mean squared error is a scalar and if we use the results from example three above, we can define a new vector
$$ \boldsymbol{w}=\boldsymbol{y}-\boldsymbol{X}\boldsymbol{\beta}, $$which depends on \( \boldsymbol{\beta} \). We rewrite the cost function as
$$ C(\boldsymbol{\beta})=\frac{1}{n}\boldsymbol{w}^T\boldsymbol{w}, $$with partial derivative
$$ \frac{\partial C(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}}=\frac{2}{n}\boldsymbol{w}^T\frac{\partial \boldsymbol{w}}{\partial \boldsymbol{\beta}}, $$and using that
$$ \frac{\partial \boldsymbol{w}}{\partial \boldsymbol{\beta}}=-\boldsymbol{X}, $$where we used the result from example two above. Inserting the last expression we obtain
$$ \frac{\partial C(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}}=-\frac{2}{n}\left(\boldsymbol{y}-\boldsymbol{X}\boldsymbol{\beta}\right)^T\boldsymbol{X}, $$or as
$$ \frac{\partial C(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}^T}=-\frac{2}{n}\boldsymbol{X}^T\left(\boldsymbol{y}-\boldsymbol{X}\boldsymbol{\beta}\right). $$