Data Analysis and Machine Learning Lectures: Optimization and Gradient Methods

Loading [MathJax]/extensions/TeX/boldsymbol.js

Computation of gradients

This in turn means that the gradient can be computed as a sum over $i$ -gradients $\nabla_\beta C(\mathbf{\beta}) = \sum_i^n \nabla_\beta c_i(\mathbf{x}_i, \mathbf{\beta}).$

Stochasticity/randomness is introduced by only taking the gradient on a subset of the data called minibatches. If there are $n$ data points and the size of each minibatch is $M$ , there will be $n/M$ minibatches. We denote these minibatches by $B_k$ where $k=1,\cdots,n/M$ .