Week 47: From Decision Trees to Ensemble Methods, Random Forests and Boosting Methods

The Squared-Error again! Steepest Descent

We start again with our cost function $ {\cal C}(\boldsymbol{y}m\boldsymbol{f})=\sum_{i=0}^{n-1}{\cal L}(y_i, f(x_i)) $ where we want to minimize This means that for every iteration, we need to optimize

$$ (\hat{\boldsymbol{f}}) = \mathrm{argmin}_{\boldsymbol{f}}\hspace{0.1cm} \sum_{i=0}^{n-1}(y_i-f(x_i))^2. $$

We define a real function $ h_m(x) $ that defines our final function $ f_M(x) $ as

$$ f_M(x) = \sum_{m=0}^M h_m(x). $$

In the steepest decent approach we approximate $ h_m(x) = -\rho_m g_m(x) $, where $ \rho_m $ is a scalar and $ g_m(x) $ the gradient defined as

$$ g_m(x_i) = \left[ \frac{\partial {\cal L}(y_i, f(x_i))}{\partial f(x_i)}\right]_{f(x_i)=f_{m-1}(x_i)}. $$

With the new gradient we can update $ f_m(x) = f_{m-1}(x) -\rho_m g_m(x) $. Using the above squared-error function we see that the gradient is $ g_m(x_i) = -2(y_i-f(x_i)) $.

Choosing $ f_0(x)=0 $ we obtain $ g_m(x) = -2y_i $ and inserting this into the minimization problem for the cost function we have

$$ (\rho_1) = \mathrm{argmin}_{\rho}\hspace{0.1cm} \sum_{i=0}^{n-1}(y_i+2\rho y_i)^2. $$