Almost every problem in machine learning and data science starts with a dataset X , a model g(\beta) , which is a function of the parameters \beta and a cost function C(X, g(\beta)) that allows us to judge how well the model g(\beta) explains the observations X . The model is fit by finding the values of \beta that minimize the cost function. Ideally we would be able to solve for \beta analytically, however this is not possible in general and we must use some approximative/numerical method to compute the minimum.