The CART algorithm for Classification

For classification, the CART algorithm splits the data set in two subsets using a single feature \( k \) and a threshold \( t_k \). This could be for example a threshold set by a number below a certain circumference of a malign tumor.

How do we find these two quantities? We search for the pair \( (k,t_k) \) that produces the purest subset using for example the gini factor \( G \). The cost function it tries to minimize is then

$$ C(k,t_k) = \frac{m_{\mathrm{left}}}{m}G_{\mathrm{left}}+ \frac{m_{\mathrm{right}}}{m}G_{\mathrm{right}}, $$

where \( G_{\mathrm{left/right}} \) measures the impurity of the left/right subset and \( m_{\mathrm{left/right}} \) is the number of instances in the left/right subset

Once it has successfully split the training set in two, it splits the subsets using the same logic, then the subsubsets and so on, recursively. It stops recursing once it reaches the maximum depth (defined by the \( max\_depth \) hyperparameter), or if it cannot find a split that will reduce impurity. A few other hyperparameters control additional stopping conditions such as the \( min\_samples\_split \), \( min\_samples\_leaf \), \( min\_weight\_fraction\_leaf \), and \( max\_leaf\_nodes \).