The gradient descent method is sensitive to the choice of learning rate \( \gamma_k \). This is due to the fact that we are only guaranteed that \( F(\mathbf{x}_{k+1}) \leq F(\mathbf{x}_k) \) for sufficiently small \( \gamma_k \). The problem is to determine an optimal learning rate. If the learning rate is chosen too small the method will take a long time to converge and if it is too large we can experience erratic behavior.
Many of these shortcomings can be alleviated by introducing randomness. One such method is that of Stochastic Gradient Descent (SGD), see below.