Data Analysis and Machine Learning: Trees, forests and all that

Loading [MathJax]/extensions/TeX/boldsymbol.js

Cost complexity pruning

For each value of

$\alpha$ there corresponds a subtree

$T \in T_0$ such that

$\sum_{m=1}^{\overline{T}}\sum_{i:x_i\in R_m}(y_i-\overline{y}_{R_m})^2+\alpha\overline{T},$ is as small as possible. Here

$\overline{T}$ is the number of terminal nodes of the tree

$T$ ,

$R_m$ is the rectangle (i.e. the subset of predictor space) corresponding to the

$m$ -th terminal node.

The tuning parameter $\alpha$ controls a trade-off between the subtree’s com- plexity and its fit to the training data. When $\alpha = 0$ , then the subtree $T$ will simply equal $T_0$ , because then the above equation just measures the training error. However, as $\alpha$ increases, there is a price to pay for having a tree with many terminal nodes. The above equation will tend to be minimized for a smaller subtree.

It turns out that as we increase $\alpha$ from zero branches get pruned from the tree in a nested and predictable fashion, so obtaining the whole sequence of subtrees as a function of $\alpha$ is easy. We can select a value of $\alpha$ using a validation set or using cross-validation. We then return to the full data set and obtain the subtree corresponding to $\alpha$ .