Gradient Boosting, algorithm

Steepest descent is however not much used, since it only optimizes \( f \) at a fixed set of \( n \) points, so we do not learn a function that can generalize. However, we can modify the algorithm by fitting a weak learner to approximate the negative gradient signal.

Suppose we have a cost function \( C(f)=\sum_{i=0}^{n-1}L(y_i, f(x_i)) \) where \( y_i \) is our target and \( f(x_i) \) the function which is meant to model \( y_i \). The above cost function could be our standard squared-error function

$$ C(\boldsymbol{y},\boldsymbol{f})=\sum_{i=0}^{n-1}(y_i-f(x_i))^2. $$

The way we proceed in an iterative fashion is to

  1. Initialize our estimate \( f_0(x) \).
  2. For \( m=1:M \), we
    1. compute the negative gradient vector \( \boldsymbol{u}_m = -\partial C(\boldsymbol{y},\boldsymbol{f})/\partial \boldsymbol{f}(x) \) at \( f(x) = f_{m-1}(x) \);
    2. fit the so-called base-learner to the negative gradient \( h_m(u_m,x) \);
    3. update the estimate \( f_m(x) = f_{m-1}(x)+h_m(u_m,x) \);
  3. The final estimate is then \( f_M(x) = \sum_{m=1}^M h_m(u_m,x) \).