Making a tree

In order to implement the recursive binary splitting we start by selecting the predictor \( x_j \) and a cutpoint \( s \) that splits the predictor space into two regions \( R_1 \) and \( R_2 \)

$$ \left\{X\vert x_j < s\right\}, $$

and

$$ \left\{X\vert x_j \geq s\right\}, $$

so that we obtain the lowest MSE, that is

$$ \sum_{i:x_i\in R_j}(y_i-\overline{y}_{R_1})^2+\sum_{i:x_i\in R_2}(y_i-\overline{y}_{R_2})^2, $$

which we want to minimize by considering all predictors \( x_1,x_2,\dots,x_p \). We consider also all possible values of \( s \) for each predictor. These values could be determined by randomly assigned numbers or by starting at the midpoint and then proceed till we find an optimal value.

For any \( j \) and \( s \), we define the pair of half-planes where \( \overline{y}_{R_1} \) is the mean response for the training observations in \( R_1(j,s) \), and \( \overline{y}_{R_2} \) is the mean response for the training observations in \( R_2(j,s) \).

Finding the values of \( j \) and \( s \) that minimize the above equation can be done quite quickly, especially when the number of features \( p \) is not too large.

Next, we repeat the process, looking for the best predictor and best cutpoint in order to split the data further so as to minimize the MSE within each of the resulting regions. However, this time, instead of splitting the entire predictor space, we split one of the two previously identified regions. We now have three regions. Again, we look to split one of these three regions further, so as to minimize the MSE. The process continues until a stopping criterion is reached; for instance, we may continue until no region contains more than five observations.