Other ingredients of a neural network

Having defined the architecture of a neural network, the optimization of the cost function with respect to the parameters \( \boldsymbol{\Theta} \), involves the calculations of gradients and their optimization. The gradients represent the derivatives of a multidimensional object and are often approximated by various gradient methods, including

  1. various quasi-Newton methods,
  2. plain gradient descent (GD) with a constant learning rate \( \eta \),
  3. GD with momentum and other approximations to the learning rates such as
  4. Stochastic gradient descent and various families of learning rate approximations