Optimizing the cost function

The network is trained by finding the weights and biases that minimize the cost function. One of the most widely used classes of methods is gradient descent and its generalizations. The idea behind gradient descent is simply to adjust the weights in the direction where the gradient of the cost function is large and negative. This ensures we flow toward a local minimum of the cost function. Each parameter \( \theta \) is iteratively adjusted according to the rule

$$ \theta_{i+1} = \theta_i - \eta \nabla \mathcal{C}(\theta_i) ,$$

where \( \eta \) is known as the learning rate, which controls how big a step we take towards the minimum. This update can be repeated for any number of iterations, or until we are satisfied with the result.

A simple and effective improvement is a variant called Batch Gradient Descent. Instead of calculating the gradient on the whole dataset, we calculate an approximation of the gradient on a subset of the data called a minibatch. If there are \( N \) data points and we have a minibatch size of \( M \), the total number of batches is \( N/M \). We denote each minibatch \( B_k \), with \( k = 1, 2,...,N/M \). The gradient then becomes:

$$ \nabla \mathcal{C}(\theta) = \frac{1}{N} \sum_{i=1}^N \nabla \mathcal{L}_i(\theta) \quad \rightarrow \quad \frac{1}{M} \sum_{i \in B_k} \nabla \mathcal{L}_i(\theta) ,$$

i.e. instead of averaging the loss over the entire dataset, we average over a minibatch.

This has two important benefits:

  1. Introducing stochasticity decreases the chance that the algorithm becomes stuck in a local minima.
  2. It significantly speeds up the calculation, since we do not have to use the entire dataset to calculate the gradient.

The various optmization methods, with codes and algorithms, are discussed in our lectures on Gradient descent approaches.