Almost every problem in machine learning and data science starts with a dataset \( X \), a model \( g(\beta) \), which is a function of the parameters \( \beta \) and a cost function \( C(X, g(\beta)) \) that allows us to judge how well the model \( g(\beta) \) explains the observations \( X \). The model is fit by finding the values of \( \beta \) that minimize the cost function. Ideally we would be able to solve for \( \beta \) analytically, however this is not possible in general and we must use some approximative/numerical method to compute the minimum.