Almost every problem in machine learning and data science starts with a dataset X , a model g(\theta) , which is a function of the parameters \theta and a cost function C(X, g(\theta)) that allows us to judge how well the model g(\theta) explains the observations X . The model is fit by finding the values of \theta that minimize the cost function. Ideally we would be able to solve for \theta analytically, however this is not possible in general and we must use some approximative/numerical method to compute the minimum.