Overfitting in Decision Trees

October 3, 2018May 21, 2023Archit Vora 2 Comments

We had previously written about classification trees and regression trees.

In this blog we will mainly write about preventing over fitting in decision trees.

Handling missing values for decision tree is described here.

Over-fitting

For decision trees as depth increases training error always go towards zero.
Occam’s razor philosophy
- Suppose there exists two explanation for an occurrence
- Then less complex explanation is generally correct
- You have headache and stomachache.
- Disease D1 explains headache, D2 explains stomachache, while D3 explains both
- Instead of diagnosing patient with both D1 and D2, saying it D3 would be more accurate in general
So when you have two trees with same validation error, choose one with lesser complexity
There are two approaches:
- Early Stopping
- Pruning

Early Stopping

There can be three stopping condition

Pre define max_depth
1. It can be tuned via cross validation, which increases training time
2. Does not work well when you have less data
Stop if decrease in classification error is less than some threshold
1. It can become tricky as in case of XOR
2. XOR is very famous case in classification literature
Stop if very few point are left in the node
1. This works pretty well in practice
2. Stop if node has say 10 to 100 sample

Pruning

Pruning generally is better solution than early stopping
Here we build the entire tree first and than remove certain node
Criterion for removing nodes:
- cost(tree) = error(tree) + λ*L(tree)
- L(tree) = no of leaf nodes
if cost(deep_tree) > cost(small_tree) then don’t split that node
Algorithm:
- Start from bottom up the tree and traverse up and test one node each time
- Calculate cost
- Make decision to prune it or not

References:

Course by University of Washington

https://www.coursera.org/learn/ml-classification

Handling missing values in Decision Tree

October 3, 2018May 21, 2023Archit Vora 1 Comment

We had previously written about classification trees and regression trees.

In this blog we will mainly write about handling missing values specifically for decision trees.

Handling over-fitting for decision tree is described here.

Missing Values

Missing value can cause issues at :

Training time
Prediction time
1. What if while prediction sample does not have value for some feature

Purification by Skipping

This approach involves skipping rows or entire features that contain missing values.
However, this method comes with a drawback as it results in the loss of valuable information.
It is important to note that skipping missing values during training will not resolve the issue during prediction time

Imputation

Imputation refers to the process of filling in missing values with estimated or imputed values.
It provides a means to address missing values during both training and prediction.
A dictionary can be created to store imputed values for each feature, which can then be utilized during prediction.
For categorical features, a common imputation method is to use the mode, while for numeric features, the average or median is often employed.
Advanced techniques, such as the expectation maximization algorithm, can also be utilized for imputation.

However, it’s essential to be aware that imputation can introduce systematic biases into the data. For example, let’s consider a scenario where people from Washington tend to not provide their age, while the rest of the United States does. This could lead to a systematic bias in the imputed values, as the imputation process may assign ages based on data from other regions.

Algo specific technique

In decision tree algorithms, there are specific techniques to handle missing values. At each node in the decision tree, we need to determine which branch the missing value will take. It’s possible for the same feature to have different values at different nodes. To address this, we can make a tweak in the decision tree algorithm:

During feature selection, consider assigning the missing value to each possible branch.
Evaluate the effect of assigning the missing value to each branch by measuring the reduction in classification error.
Choose the feature and branch that result in the highest reduction in classification error when the missing value is assigned.

However, what if we encounter a node during prediction where we didn’t have any missing values during training? In such cases, we can store the direction with the most number of samples at that node as the default direction for missing values. Let’s call this direction “max_sample.”

If we encounter missing values while training the decision tree, we can override the default “max_sample” direction with the observed direction for the missing values.

By incorporating these techniques, we can effectively handle missing values in decision trees and make informed decisions based on the available data.

References:

Course by University of Washington

https://www.coursera.org/learn/ml-classification

AdaBoost

July 23, 2018October 25, 2020Archit Vora 2 Comments

AdaBoost stands for adaptive boosting
Boosting is a technique for creating better classifier form several regular/weak/base classifiers
Thought:
- We want to train second classifier in such a way that training data has more samples for the case where first classifier was wrong
- Third classifier will have more samples where first and second classifier contradicts
How do we accomplish above:
- We assign weights to each sample
  - This weights will be used in error function for each sample
  - Error function would be sum of weights of incorrectly classified samples
  - This weights keep changing in each iteration
- We also assign wight to each base (weak) classifier (α In below algorithm)
  - Formula for calculating α below is valid for binary classification
    - It will be different for multi class classification
When do we stop iteration ?
- Combined classifier is predicting all sample correctly
- No more base classifier left
Most machine learning algorithms allows to apply weight to data
- In logistic regression we can multiply weight in error fucntion
- In decision tree, classification error can be calculated by summing weights of missclassified samples

adboost

Formulations in above algorithm:

Formula of α
- if ε=0.5 then α=0. It is just a random chance classifier.
- Else it can positive/negative.
- ε < 0.5 => negative α => we invert the prediction in weighted average
Formula of D_(t+1)(i)
- y*h is just a sign
  - y and h are {-1, 1}
  - y*h will be positive for correct prediction
- Why it is function of ε
  - Random classifier updating weights does not make sense
By Z_t we are normalizing weights, which ensures numerical stability

Adaboost Theorem

Error -> 0 as no of iteration -> infinity at certain condition
Condition is here is that each learner has error < 0.5
Each iteration means adding one weak learner

Over-fitting

Contrary to weak classifier (decision stumps) we saw that rate of increase test error is less with boosting
However no of iteration still should be selected via cross validation for small data-set and validation for large data-set

References:

Click to access MIT6_034F10_boosting.pdf

Click to access msri.pdf

Classification Trees

May 1, 2018May 21, 2023Archit Vora 3 Comments

How classification Tree creates Rectangle in Predictor Space

Criterion for Choosing the Splitting Predictor

When deciding which predictor to split on, we need to consider different types of predictors: continuous, binary, and categorical. In this post, we will focus on continuous and binary predictors.

For categorical predictors, one approach is to use the “one vs rest” strategy. We evaluate each category separately and observe which one reduces the randomness the most when split.

In the case of a continuous predictor, we first sort the values and then select the midpoint as the split point.

To keep our decision tree simple, it’s ideal to have a small tree size. Therefore, at each step, we should choose the split that leads to the purest child nodes.

Two commonly used criteria for measuring impurity are Gini and Entropy. It’s important to note that these criteria are not directly used to select the predictor to split on. Instead, we calculate the difference in impurity before and after splitting.

When computing Gini or entropy after splitting, we apply weights to both splits to account for their proportions.

Multi-class

Decision tree can naturally handle multi class problem as entropy and gini can be calculated for multi-class.

Entropy

Formula:
- H = ∑ -p * log(p)
About :
- Randomness
- Information Carries
- Highest when p=0.5
- Higher Entropy => Higher Randomness => Carries more information
- After splitting we want entropy to reduce

Initial :
- n0 positive and m0 negative sample
- g0 = (n+m) total samples
After Splitting
- Group 1:
  - n1 positive
  - m1 negative
  - g1 = (n1 + m1) total
- Group 2:
  - n2 positive
  - m2 negative
  - g2 = (n2 + m2) total

Before Entropy :
- H_before = -(n0/g0) log(n0/g0) – (m0/g0)log(m0/g0)
After Entropy :
- H1 = -(n1/g1) log(n1/g1) – (m1/g1)log(m1/g1)
- H2 = -(n2/g2) log(n2/g2) – (m2/g2)log(m2/g2)
- H_after = (g1/g0) * H1 + (g2/g0)*H2
diff = H_before – H_after
Select a predictor for which diff is highest and split on it
- Which means we are selecting a predictor which reduces randomness more
- Which also mean we are selecting a predictor which reduces information more
  - So we are selecting a variable which carries more information
    - More important feature

Gini

Gini for 2 class :
- G = p1 * (1-p1) + p2 * (1 – p2)
- Using (p1 + p2 = 1) we can derive that G = 2*p1*p2
Gini in general:
- G = ∑ p * (1 – p) = 1 – ∑ p²
Like entropy gini is maximum when p1 = p2 = 0.5
- In this case nodes are least pure
Gini is a measure of impurity which we want to reduce by splitting on predictor
diff = G_before – G_after
- We will select a predictor for which diff is highest

Entropy vs Gini

Given a choice gini will be better as computing log is costlier.
Both of them are better than “classification error as metric” [1]
- Which would be selecting majority class for calculating classification error
- It is useful while pruning but not while growing the tree

ROC Curve

ROC curve can be constructed easily for classifier which outputs ranking.
- For example logistic recession outputs probability where we can change the threshold.
On the contrary decision tree outputs label
However to get a ROC we can use workaround. Each leaf node will have some positive and negative samples and we can calculate probability as (# positive / # total).
Then we can change the threshold and plot ROC.

Regularization

Regularization in decision tree will be explained in separate blog in detail, but here is the list of candidate techniques.

limit max. depth of trees
Cost complexity pruning
ensembles / bag more than just 1 tree
set stricter stopping criterion on when to split a node further (e.g. min gain, number of samples etc.)

Reference

[0] : Applied predictive modeling by Max Kuhn and Kjell Johnson

[1] : https://github.com/rasbt/python-machine-learning-book/blob/master/faq/decision-tree-binary.md

Data Stories

Tag decision tree

Overfitting in Decision Trees

Over-fitting

Early Stopping

Pruning

References:

Handling missing values in Decision Tree

Missing Values

Purification by Skipping

Imputation

Algo specific technique

References:

AdaBoost

Classification Trees

Criterion for Choosing the Splitting Predictor

Multi-class

Entropy

Gini

Entropy vs Gini

ROC Curve

Regularization

Reference