SGD example
As an example, suppose we have \( 10 \) data points \( (\mathbf{x}_1,\cdots, \mathbf{x}_{10}) \)
and we choose to have \( M=5 \) minibathces,
then each minibatch contains two data points. In particular we have
\( B_1 = (\mathbf{x}_1,\mathbf{x}_2), \cdots, B_5 =
(\mathbf{x}_9,\mathbf{x}_{10}) \). Note that if you choose \( M=1 \) you
have only a single batch with all data points and on the other extreme,
you may choose \( M=n \) resulting in a minibatch for each datapoint, i.e
\( B_k = \mathbf{x}_k \).
The idea is now to approximate the gradient by replacing the sum over
all data points with a sum over the data points in one the minibatches
picked at random in each gradient descent step
$$
\nabla_{\beta}
C(\mathbf{\beta}) = \sum_{i=1}^n \nabla_\beta c_i(\mathbf{x}_i,
\mathbf{\beta}) \rightarrow \sum_{i \in B_k}^n \nabla_\beta
c_i(\mathbf{x}_i, \mathbf{\beta}).
$$