overfitting | Data Stories

We had previously written about classification trees and regression trees.

In this blog we will mainly write about preventing over fitting in decision trees.

Handling missing values for decision tree is described here.

For decision trees as depth increases training error always go towards zero.
Occam’s razor philosophy
- Suppose there exists two explanation for an occurrence
- Then less complex explanation is generally correct
- You have headache and stomachache.
- Disease D1 explains headache, D2 explains stomachache, while D3 explains both
- Instead of diagnosing patient with both D1 and D2, saying it D3 would be more accurate in general
So when you have two trees with same validation error, choose one with lesser complexity
There are two approaches:
- Early Stopping
- Pruning

There can be three stopping condition

Pre define max_depth
1. It can be tuned via cross validation, which increases training time
2. Does not work well when you have less data
Stop if decrease in classification error is less than some threshold
1. It can become tricky as in case of XOR
2. XOR is very famous case in classification literature
Stop if very few point are left in the node
1. This works pretty well in practice
2. Stop if node has say 10 to 100 sample

Pruning generally is better solution than early stopping
Here we build the entire tree first and than remove certain node
Criterion for removing nodes:
- cost(tree) = error(tree) + λ*L(tree)
- L(tree) = no of leaf nodes
if cost(deep_tree) > cost(small_tree) then don’t split that node
Algorithm:
- Start from bottom up the tree and traverse up and test one node each time
- Calculate cost
- Make decision to prune it or not

Course by University of Washington

Data Stories