Stochastic Gradient Descent

Stochastic gradient descent (SGD) and variants thereof address some of the shortcomings of the Gradient descent method discussed above.

The underlying idea of SGD comes from the observation that the cost function, which we want to minimize, can almost always be written as a sum over \( n \) data points \( \{\mathbf{x}_i\}_{i=1}^n \),

$$ C(\mathbf{\beta}) = \sum_{i=1}^n c_i(\mathbf{x}_i, \mathbf{\beta}). $$