Classification Trees

How classification Tree creates Rectangle in Predictor Space

Criterion for Choosing the Splitting Predictor

When deciding which predictor to split on, we need to consider different types of predictors: continuous, binary, and categorical. In this post, we will focus on continuous and binary predictors.

For categorical predictors, one approach is to use the “one vs rest” strategy. We evaluate each category separately and observe which one reduces the randomness the most when split.

In the case of a continuous predictor, we first sort the values and then select the midpoint as the split point.

To keep our decision tree simple, it’s ideal to have a small tree size. Therefore, at each step, we should choose the split that leads to the purest child nodes.

Two commonly used criteria for measuring impurity are Gini and Entropy. It’s important to note that these criteria are not directly used to select the predictor to split on. Instead, we calculate the difference in impurity before and after splitting.

When computing Gini or entropy after splitting, we apply weights to both splits to account for their proportions.

Multi-class

Decision tree can naturally handle multi class problem as entropy and gini can be calculated for multi-class.

Entropy

  • Formula:
    • H = ∑ -p * log(p)
  • About :
    • Randomness
    • Information Carries
    • Highest when p=0.5
    • Higher Entropy => Higher Randomness => Carries more information
    • After splitting we want entropy to reduce
  • Initial :
    • n0 positive and m0 negative sample
    • g0 = (n+m) total samples
  • After Splitting
    • Group 1:
      • n1 positive
      • m1 negative
      • g1 = (n1 + m1) total
    • Group 2:
      • n2 positive
      • m2 negative
      • g2 = (n2 + m2) total
  • Before Entropy :
    • H_before =  -(n0/g0) log(n0/g0) – (m0/g0)log(m0/g0)
  • After Entropy :
    • H1 = -(n1/g1) log(n1/g1) – (m1/g1)log(m1/g1)
    • H2 = -(n2/g2) log(n2/g2) – (m2/g2)log(m2/g2)
    • H_after = (g1/g0) * H1 + (g2/g0)*H2
  • diff = H_before – H_after
  • Select a predictor for which diff is highest and split on it
    • Which means we are selecting a predictor which reduces randomness more
    • Which also mean we are selecting a predictor which reduces information more
      • So we are selecting a variable which carries more information
        • More important feature

Gini

  • Gini for 2 class :
    • G = p1 * (1-p1) + p2 * (1 – p2)
    • Using (p1 + p2 = 1) we can derive that G = 2*p1*p2
  • Gini in general:
    • G = ∑ p * (1 – p) = 1 – ∑ p²
  • Like entropy gini is maximum when p1 = p2 = 0.5
    • In this case nodes are least pure
  • Gini is a measure of impurity which we want to reduce by splitting on predictor
  • diff = G_before – G_after
    • We will select a predictor for which diff is highest

Entropy vs Gini

  • Given a choice gini will be better as computing log is costlier.
  • Both of them are better than “classification error as metric” [1]
    • Which would be selecting majority class for calculating classification error
    • It is useful while pruning but not while growing the tree

ROC Curve

  • ROC curve can be constructed easily for classifier which outputs ranking.
  • On the contrary decision tree outputs label
  • However to get a ROC we can use workaround. Each leaf node will have some positive and negative samples and we can calculate probability as (# positive / # total).
  • Then we can change the threshold and plot ROC.

Regularization

Regularization in decision tree will be explained in separate blog in detail, but here is the list of candidate techniques.

  1. limit max. depth of trees
  2. Cost complexity pruning
  3. ensembles / bag more than just 1 tree
  4. set stricter stopping criterion on when to split a node further (e.g. min gain, number of samples etc.)

Reference

[0] : Applied predictive modeling by Max Kuhn and Kjell Johnson

[1] : https://github.com/rasbt/python-machine-learning-book/blob/master/faq/decision-tree-binary.md

3 thoughts on “Classification Trees

Leave a comment