Week 35: From Ordinary Linear Regression to Ridge and Lasso Regression

Loading [MathJax]/extensions/TeX/boldsymbol.js

The mean squared error and its derivative

We defined earlier a possible cost function using the mean squared error

$C(\boldsymbol{\beta})=\frac{1}{n}\sum_{i=0}^{n-1}\left(y_i-\tilde{y}_i\right)^2=\frac{1}{n}\left\{\left(\boldsymbol{y}-\boldsymbol{\tilde{y}}\right)^T\left(\boldsymbol{y}-\boldsymbol{\tilde{y}}\right)\right\},$

or using the design/feature matrix $\boldsymbol{X}$ we have the more compact matrix-vector

$C(\boldsymbol{\beta})=\frac{1}{n}\left\{\left(\boldsymbol{y}-\boldsymbol{X}\boldsymbol{\beta}\right)^T\left(\boldsymbol{y}-\boldsymbol{X}\boldsymbol{\beta}\right)\right\}.$

We note that the design matrix $\boldsymbol{X}$ does not depend on the unknown parameters defined by the vector $\boldsymbol{\beta}$ . We are now interested in minimizing the cost function with respect to the unknown parameters $\boldsymbol{\beta}$ .

The mean squared error is a scalar and if we use the results from example three above, we can define a new vector

$\boldsymbol{w}=\boldsymbol{y}-\boldsymbol{X}\boldsymbol{\beta},$

which depends on $\boldsymbol{\beta}$ . We rewrite the cost function as

$C(\boldsymbol{\beta})=\frac{1}{n}\boldsymbol{w}^T\boldsymbol{w},$

with partial derivative

$\frac{\partial C(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}}=\frac{2}{n}\boldsymbol{w}^T\frac{\partial \boldsymbol{w}}{\partial \boldsymbol{\beta}},$

and using that

$\frac{\partial \boldsymbol{w}}{\partial \boldsymbol{\beta}}=-\boldsymbol{X},$

where we used the result from example two above. Inserting the last expression we obtain

$\frac{\partial C(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}}=-\frac{2}{n}\left(\boldsymbol{y}-\boldsymbol{X}\boldsymbol{\beta}\right)^T\boldsymbol{X},$

or as

$\frac{\partial C(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}^T}=-\frac{2}{n}\boldsymbol{X}^T\left(\boldsymbol{y}-\boldsymbol{X}\boldsymbol{\beta}\right).$