Week 36: Linear Regression and Gradient descent

Deriving the Lasso Regression Equations

Using the matrix-vector expression for Lasso regression, we have the following cost function

$$ C(\boldsymbol{X},\boldsymbol{\theta})=\frac{1}{n}\left\{(\boldsymbol{y}-\boldsymbol{X}\boldsymbol{\theta})^T(\boldsymbol{y}-\boldsymbol{X}\boldsymbol{\theta})\right\}+\lambda\vert\vert\boldsymbol{\theta}\vert\vert_1, $$

Taking the derivative with respect to $ \boldsymbol{\theta} $ and recalling that the derivative of the absolute value is (we drop the boldfaced vector symbol for simplicity)

$$ \frac{d \vert \theta\vert}{d \theta}=\mathrm{sgn}(\theta)=\left\{\begin{array}{cc} 1 & \theta > 0 \\-1 & \theta < 0, \end{array}\right. $$

we have that the derivative of the cost function is

$$ \frac{\partial C(\boldsymbol{X},\boldsymbol{\theta})}{\partial \boldsymbol{\theta}}=-\frac{2}{n}\boldsymbol{X}^T(\boldsymbol{y}-\boldsymbol{X}\boldsymbol{\theta})+\lambda sgn(\boldsymbol{\theta})=0, $$

and reordering we have

$$ \boldsymbol{X}^T\boldsymbol{X}\boldsymbol{\theta}+\lambda sgn(\boldsymbol{\theta})=\boldsymbol{X}^T\boldsymbol{y}. $$

This equation does not lead to a nice analytical equation as in Ridge regression or ordinary least squares. We have absorbed the factor $ 2/n $ in a redefinition of the parameter $ \lambda $. We will solve this type of problems using libraries like scikit-learn and using our own gradient descent code in project 1.