Data Analysis and Machine Learning Lectures: Optimization and Gradient Methods

Loading [MathJax]/extensions/TeX/boldsymbol.js

The ideal

Ideally the sequence $\{\mathbf{x}_k \}_{k=0}$ converges to a global minimum of the function $F$ . In general we do not know if we are in a global or local minimum. In the special case when $F$ is a convex function, all local minima are also global minima, so in this case gradient descent can converge to the global solution. The advantage of this scheme is that it is conceptually simple and straightforward to implement. However the method in this form has some severe limitations:

In machine learing we are often faced with non-convex high dimensional cost functions with many local minima. Since GD is deterministic we will get stuck in a local minimum, if the method converges, unless we have a very good intial guess. This also implies that the scheme is sensitive to the chosen initial condition.

Note that the gradient is a function of $\mathbf{x} = (x_1,\cdots,x_n)$ which makes it expensive to compute numerically.