To think about, first part

When you are comparing your own code with for example Scikit-Learn's library, there are some technicalities to keep in mind. The examples here demonstrate some of these aspects with potential pitfalls.

The discussion here focuses on the role of the intercept, how we can set up the design matrix, what scaling we should use and other topics which tend confuse us.

The intercept can be interpreted as the expected value of our target/output variables when all other predictors are set to zero. Thus, if we cannot assume that the expected outputs/targets are zero when all predictors are zero (the columns in the design matrix), it may be a bad idea to implement a model which penalizes the intercept. Furthermore, in for example Ridge and Lasso regression (to be discussed in moe detail next week), the default solutions from the library Scikit-Learn (when not shrinking \( \beta_0 \)) for the unknown parameters \( \boldsymbol{\beta} \), are derived under the assumption that both \( \boldsymbol{y} \) and \( \boldsymbol{X} \) are zero centered, that is we subtract the mean values.