Week 34: Introduction to the course, Logistics and Practicalities

Examples

In order to understand the relation among the predictors (or features or properties) $ p $, the set of data $ n $ and the target (outcome, output etc) $ \boldsymbol{y} $, consider the model we discussed for describing nuclear binding energies.

There we assumed that we could parametrize the data using a polynomial approximation based on the liquid drop model. Assuming

$$ BE(A) = a_0+a_1A+a_2A^{2/3}+a_3A^{-1/3}+a_4A^{-1}, $$

we have five predictors, that is the intercept, the $ A $ dependent term, the $ A^{2/3} $ term and the $ A^{-1/3} $ and $ A^{-1} $ terms. This gives $ p=0,1,2,3,4 $. Furthermore we have $ n $ entries for each predictor. It means that our design matrix is a $ p\times n $ matrix $ \boldsymbol{X} $.

Here the predictors are based on a model we have made. A popular data set which is widely encountered in ML applications is the so-called credit card default data from Taiwan. The data set contains data on $ n=30000 $ credit card holders with predictors like gender, marital status, age, profession, education, etc. In total there are $ 24 $ such predictors or attributes leading to a design matrix of dimensionality $ 24 \times 30000 $. This is however a classification problem and we will come back to it when we discuss Logistic Regression.