Our data which we want to apply a machine learning method on, consist of a set of inputs \boldsymbol{x}^T=[x_0,x_1,x_2,\dots,x_{n-1}] and the outputs we want to model \boldsymbol{y}^T=[y_0,y_1,y_2,\dots,y_{n-1}] . We assume that the output data can be represented (for a regression case) by a continuous function f through
y_i=f(x_i)+\epsilon_i,or in general
\boldsymbol{y}=f(\boldsymbol{x})+\boldsymbol{\epsilon},where \boldsymbol{\epsilon} represents some noise which is normally assumed to be distributed via a normal probability distribution with zero mean value and a variance \sigma^2 .
In linear regression we approximate the unknown function with another continuous function \tilde{\boldsymbol{y}}(\boldsymbol{x}) which depends linearly on some unknown parameters \boldsymbol{\beta}^T=[\beta_0,\beta_1,\beta_2,\dots,\beta_{p-1}] .
Last week we introduced the so-called design matrix in order to define the approximation \boldsymbol{\tilde{y}} via the unknown quantity \boldsymbol{\beta} as
\boldsymbol{\tilde{y}}= \boldsymbol{X}\boldsymbol{\beta},and in order to find the optimal parameters \beta_i we defined a function which gives a measure of the spread between the values y_i (which represent the output values we want to reproduce) and the parametrized values \tilde{y}_i , namely the so-called cost/loss function.