The equations for ordinary least squares

Our data which we want to apply a machine learning method on, consist of a set of inputs \( \boldsymbol{x}^T=[x_0,x_1,x_2,\dots,x_{n-1}] \) and the outputs we want to model \( \boldsymbol{y}^T=[y_0,y_1,y_2,\dots,y_{n-1}] \). We assume that the output data can be represented (for a regression case) by a continuous function \( f \) through

$$ y_i=f(x_i)+\epsilon_i, $$

or in general

$$ \boldsymbol{y}=f(\boldsymbol{x})+\boldsymbol{\epsilon}, $$

where \( \boldsymbol{\epsilon} \) represents some noise which is normally assumed to be distributed via a normal probability distribution with zero mean value and a variance \( \sigma^2 \).

In linear regression we approximate the unknown function with another continuous function \( \tilde{\boldsymbol{y}}(\boldsymbol{x}) \) which depends linearly on some unknown parameters \( \boldsymbol{\beta}^T=[\beta_0,\beta_1,\beta_2,\dots,\beta_{p-1}] \).

Last week we introduced the so-called design matrix in order to define the approximation \( \boldsymbol{\tilde{y}} \) via the unknown quantity \( \boldsymbol{\beta} \) as

$$ \boldsymbol{\tilde{y}}= \boldsymbol{X}\boldsymbol{\beta}, $$

and in order to find the optimal parameters \( \beta_i \) we defined a function which gives a measure of the spread between the values \( y_i \) (which represent the output values we want to reproduce) and the parametrized values \( \tilde{y}_i \), namely the so-called cost/loss function.