The assumption we have made here can be summarized as (and this is going to be useful when we discuss the bias-variance trade off) that there exists a function f(x) and a normal distributed error ε∼N(0,σ2) which describe our data
y=f(x)+εWe approximate this function with our model from the solution of the linear regression equations, that is our function f is approximated by ˜y where we want to minimize (y−˜y)2, our MSE, with
˜y=Xβ.