The mean squared error and its derivative

We defined earlier a possible cost function using the mean squared error

$$ C(\boldsymbol{\beta})=\frac{1}{n}\sum_{i=0}^{n-1}\left(y_i-\tilde{y}_i\right)^2=\frac{1}{n}\left\{\left(\boldsymbol{y}-\boldsymbol{\tilde{y}}\right)^T\left(\boldsymbol{y}-\boldsymbol{\tilde{y}}\right)\right\}, $$

or using the design/feature matrix \( \boldsymbol{X} \) we have the more compact matrix-vector

$$ C(\boldsymbol{\beta})=\frac{1}{n}\left\{\left(\boldsymbol{y}-\boldsymbol{X}\boldsymbol{\beta}\right)^T\left(\boldsymbol{y}-\boldsymbol{X}\boldsymbol{\beta}\right)\right\}. $$

We note that the design matrix \( \boldsymbol{X} \) does not depend on the unknown parameters defined by the vector \( \boldsymbol{\beta} \). We are now interested in minimizing the cost function with respect to the unknown parameters \( \boldsymbol{\beta} \).

The mean squared error is a scalar and if we use the results from example three above, we can define a new vector

$$ \boldsymbol{w}=\boldsymbol{y}-\boldsymbol{X}\boldsymbol{\beta}, $$

which depends on \( \boldsymbol{\beta} \). We rewrite the cost function as

$$ C(\boldsymbol{\beta})=\frac{1}{n}\boldsymbol{w}^T\boldsymbol{w}, $$

with partial derivative

$$ \frac{\partial C(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}}=\frac{2}{n}\boldsymbol{w}^T\frac{\partial \boldsymbol{w}}{\partial \boldsymbol{\beta}}, $$

and using that

$$ \frac{\partial \boldsymbol{w}}{\partial \boldsymbol{\beta}}=-\boldsymbol{X}, $$

where we used the result from example two above. Inserting the last expression we obtain

$$ \frac{\partial C(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}}=-\frac{2}{n}\left(\boldsymbol{y}-\boldsymbol{X}\boldsymbol{\beta}\right)^T\boldsymbol{X}, $$

or as

$$ \frac{\partial C(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}^T}=-\frac{2}{n}\boldsymbol{X}^T\left(\boldsymbol{y}-\boldsymbol{X}\boldsymbol{\beta}\right). $$