Classification Trees

How classification Tree creates Rectangle in Predictor Space

Criterion for Choosing the Splitting Predictor

When deciding which predictor to split on, we need to consider different types of predictors: continuous, binary, and categorical. In this post, we will focus on continuous and binary predictors.

For categorical predictors, one approach is to use the “one vs rest” strategy. We evaluate each category separately and observe which one reduces the randomness the most when split.

In the case of a continuous predictor, we first sort the values and then select the midpoint as the split point.

To keep our decision tree simple, it’s ideal to have a small tree size. Therefore, at each step, we should choose the split that leads to the purest child nodes.

Two commonly used criteria for measuring impurity are Gini and Entropy. It’s important to note that these criteria are not directly used to select the predictor to split on. Instead, we calculate the difference in impurity before and after splitting.

When computing Gini or entropy after splitting, we apply weights to both splits to account for their proportions.

Multi-class

Decision tree can naturally handle multi class problem as entropy and gini can be calculated for multi-class.

Entropy

Formula:
- H = ∑ -p * log(p)
About :
- Randomness
- Information Carries
- Highest when p=0.5
- Higher Entropy => Higher Randomness => Carries more information
- After splitting we want entropy to reduce

Initial :
- n0 positive and m0 negative sample
- g0 = (n+m) total samples
After Splitting
- Group 1:
  - n1 positive
  - m1 negative
  - g1 = (n1 + m1) total
- Group 2:
  - n2 positive
  - m2 negative
  - g2 = (n2 + m2) total

Before Entropy :
- H_before = -(n0/g0) log(n0/g0) – (m0/g0)log(m0/g0)
After Entropy :
- H1 = -(n1/g1) log(n1/g1) – (m1/g1)log(m1/g1)
- H2 = -(n2/g2) log(n2/g2) – (m2/g2)log(m2/g2)
- H_after = (g1/g0) * H1 + (g2/g0)*H2
diff = H_before – H_after
Select a predictor for which diff is highest and split on it
- Which means we are selecting a predictor which reduces randomness more
- Which also mean we are selecting a predictor which reduces information more
  - So we are selecting a variable which carries more information
    - More important feature

Gini

Gini for 2 class :
- G = p1 * (1-p1) + p2 * (1 – p2)
- Using (p1 + p2 = 1) we can derive that G = 2*p1*p2
Gini in general:
- G = ∑ p * (1 – p) = 1 – ∑ p²
Like entropy gini is maximum when p1 = p2 = 0.5
- In this case nodes are least pure
Gini is a measure of impurity which we want to reduce by splitting on predictor
diff = G_before – G_after
- We will select a predictor for which diff is highest

Entropy vs Gini

Given a choice gini will be better as computing log is costlier.
Both of them are better than “classification error as metric” [1]
- Which would be selecting majority class for calculating classification error
- It is useful while pruning but not while growing the tree

ROC Curve

ROC curve can be constructed easily for classifier which outputs ranking.
- For example logistic recession outputs probability where we can change the threshold.
On the contrary decision tree outputs label
However to get a ROC we can use workaround. Each leaf node will have some positive and negative samples and we can calculate probability as (# positive / # total).
Then we can change the threshold and plot ROC.

Regularization

Regularization in decision tree will be explained in separate blog in detail, but here is the list of candidate techniques.

limit max. depth of trees
Cost complexity pruning
ensembles / bag more than just 1 tree
set stricter stopping criterion on when to split a node further (e.g. min gain, number of samples etc.)

Reference

[0] : Applied predictive modeling by Max Kuhn and Kjell Johnson

[1] : https://github.com/rasbt/python-machine-learning-book/blob/master/faq/decision-tree-binary.md

Data Stories

Classification Trees

Criterion for Choosing the Splitting Predictor

Multi-class

Entropy

Gini

Entropy vs Gini

ROC Curve

Regularization

Reference

3 thoughts on “Classification Trees”

Leave a comment Cancel reply

Criterion for Choosing the Splitting Predictor

Multi-class

Entropy

Gini

Entropy vs Gini

ROC Curve

Regularization

Reference

Share this:

Related

3 thoughts on “Classification Trees”

Leave a comment Cancel reply