Other ingredients of a neural network
Having defined the architecture of a neural network, the optimization
of the cost function with respect to the parameters \( \boldsymbol{\Theta} \),
involves the calculations of gradients and their optimization. The
gradients represent the derivatives of a multidimensional object and
are often approximated by various gradient methods, including
- various quasi-Newton methods,
- plain gradient descent (GD) with a constant learning rate \( \eta \),
- GD with momentum and other approximations to the learning rates such as
- Adapative gradient (ADAgrad)
- Root mean-square propagation (RMSprop)
- Adaptive gradient with momentum (ADAM) and many other
- Stochastic gradient descent and various families of learning rate approximations