SGD example
As an example, suppose we have
10 data points
(\mathbf{x}_1,\cdots, \mathbf{x}_{10})
and we choose to have
M=5 minibathces,
then each minibatch contains two data points. In particular we have
B_1 = (\mathbf{x}_1,\mathbf{x}_2), \cdots, B_5 =
(\mathbf{x}_9,\mathbf{x}_{10}) . Note that if you choose
M=1 you
have only a single batch with all data points and on the other extreme,
you may choose
M=n resulting in a minibatch for each datapoint, i.e
B_k = \mathbf{x}_k .
The idea is now to approximate the gradient by replacing the sum over
all data points with a sum over the data points in one the minibatches
picked at random in each gradient descent step
\nabla_{\beta}
C(\mathbf{\beta}) = \sum_{i=1}^n \nabla_\beta c_i(\mathbf{x}_i,
\mathbf{\beta}) \rightarrow \sum_{i \in B_k}^n \nabla_\beta
c_i(\mathbf{x}_i, \mathbf{\beta}).