For each value of \( \alpha \) there corresponds a subtree \( T \in T_0 \) such that
$$ \sum_{m=1}^{\overline{T}}\sum_{i:x_i\in R_m}(y_i-\overline{y}_{R_m})^2+\alpha\overline{T}, $$is as small as possible. Here \( \overline{T} \) is the number of terminal nodes of the tree \( T \) , \( R_m \) is the rectangle (i.e. the subset of predictor space) corresponding to the \( m \)-th terminal node.
The tuning parameter \( \alpha \) controls a trade-off between the subtree’s complexity and its fit to the training data. When \( \alpha = 0 \), then the subtree \( T \) will simply equal \( T_0 \), because then the above equation just measures the training error. However, as \( \alpha \) increases, there is a price to pay for having a tree with many terminal nodes. The above equation will tend to be minimized for a smaller subtree.
It turns out that as we increase \( \alpha \) from zero branches get pruned from the tree in a nested and predictable fashion, so obtaining the whole sequence of subtrees as a function of \( \alpha \) is easy. We can select a value of \( \alpha \) using a validation set or using cross-validation. We then return to the full data set and obtain the subtree corresponding to \( \alpha \).