Loading [MathJax]/extensions/TeX/boldsymbol.js

 

 

 

The CART algorithm for Classification

For classification, the CART algorithm splits the data set in two subsets using a single feature k and a threshold t_k . This could be for example a threshold set by a number below a certain circumference of a malign tumor.

How do we find these two quantities? We search for the pair (k,t_k) that produces the purest subset using for example the gini factor G . The cost function it tries to minimize is then

C(k,t_k) = \frac{m_{\mathrm{left}}}{m}G_{\mathrm{left}}+ \frac{m_{\mathrm{right}}}{m}G_{\mathrm{right}},

where G_{\mathrm{left/right}} measures the impurity of the left/right subset and m_{\mathrm{left/right}} is the number of instances in the left/right subset

Once it has successfully split the training set in two, it splits the subsets using the same logic, then the subsubsets and so on, recursively. It stops recursing once it reaches the maximum depth (defined by the max\_depth hyperparameter), or if it cannot find a split that will reduce impurity. A few other hyperparameters control additional stopping conditions such as the min\_samples\_split , min\_samples\_leaf , min\_weight\_fraction\_leaf , and max\_leaf\_nodes .