Week 35: From Ordinary Linear Regression to Ridge and Lasso Regression

Loading [MathJax]/extensions/TeX/boldsymbol.js

The equations for ordinary least squares

Our data which we want to apply a machine learning method on, consist of a set of inputs $\boldsymbol{x}^T=[x_0,x_1,x_2,\dots,x_{n-1}]$ and the outputs we want to model $\boldsymbol{y}^T=[y_0,y_1,y_2,\dots,y_{n-1}]$ . We assume that the output data can be represented (for a regression case) by a continuous function $f$ through

$y_i=f(x_i)+\epsilon_i,$

or in general

$\boldsymbol{y}=f(\boldsymbol{x})+\boldsymbol{\epsilon},$

where $\boldsymbol{\epsilon}$ represents some noise which is normally assumed to be distributed via a normal probability distribution with zero mean value and a variance $\sigma^2$ .

In linear regression we approximate the unknown function with another continuous function $\tilde{\boldsymbol{y}}(\boldsymbol{x})$ which depends linearly on some unknown parameters $\boldsymbol{\beta}^T=[\beta_0,\beta_1,\beta_2,\dots,\beta_{p-1}]$ .

Last week we introduced the so-called design matrix in order to define the approximation $\boldsymbol{\tilde{y}}$ via the unknown quantity $\boldsymbol{\beta}$ as

$\boldsymbol{\tilde{y}}= \boldsymbol{X}\boldsymbol{\beta},$

and in order to find the optimal parameters $\beta_i$ we defined a function which gives a measure of the spread between the values $y_i$ (which represent the output values we want to reproduce) and the parametrized values $\tilde{y}_i$ , namely the so-called cost/loss function.