Week 47: From Decision Trees to Ensemble Methods, Random Forests and Boosting Methods

Loading [MathJax]/extensions/TeX/boldsymbol.js

Gradient Boosting, algorithm

Steepest descent is however not much used, since it only optimizes $f$ at a fixed set of $n$ points, so we do not learn a function that can generalize. However, we can modify the algorithm by fitting a weak learner to approximate the negative gradient signal.

Suppose we have a cost function $C(f)=\sum_{i=0}^{n-1}L(y_i, f(x_i))$ where $y_i$ is our target and $f(x_i)$ the function which is meant to model $y_i$ . The above cost function could be our standard squared-error function

$C(\boldsymbol{y},\boldsymbol{f})=\sum_{i=0}^{n-1}(y_i-f(x_i))^2.$

The way we proceed in an iterative fashion is to

Initialize our estimate $f_0(x)$ .
For m=1:M , we
1. compute the negative gradient vector $\boldsymbol{u}_m = -\partial C(\boldsymbol{y},\boldsymbol{f})/\partial \boldsymbol{f}(x)$ at $f(x) = f_{m-1}(x)$ ;
2. fit the so-called base-learner to the negative gradient $h_m(u_m,x)$ ;
3. update the estimate $f_m(x) = f_{m-1}(x)+h_m(u_m,x)$ ;
The final estimate is then $f_M(x) = \sum_{m=1}^M h_m(u_m,x)$ .