We had previously written about classification trees and regression trees.
In this blog we will mainly write about preventing over fitting in decision trees.
Handling missing values for decision tree is described here.
Over-fitting
- For decision trees as depth increases training error always go towards zero.
- Occam’s razor philosophy
- Suppose there exists two explanation for an occurrence
- Then less complex explanation is generally correct
- You have headache and stomachache.
- Disease D1 explains headache, D2 explains stomachache, while D3 explains both
- Instead of diagnosing patient with both D1 and D2, saying it D3 would be more accurate in general
- So when you have two trees with same validation error, choose one with lesser complexity
- There are two approaches:
- Early Stopping
- Pruning
Early Stopping
There can be three stopping condition
- Pre define max_depth
- It can be tuned via cross validation, which increases training time
- Does not work well when you have less data
- Stop if decrease in classification error is less than some threshold
- It can become tricky as in case of XOR
- XOR is very famous case in classification literature
- Stop if very few point are left in the node
- This works pretty well in practice
- Stop if node has say 10 to 100 sample
Pruning
- Pruning generally is better solution than early stopping
- Here we build the entire tree first and than remove certain node
- Criterion for removing nodes:
- cost(tree) = error(tree) + λ*L(tree)
- L(tree) = no of leaf nodes
- if cost(deep_tree) > cost(small_tree) then don’t split that node
- Algorithm:
- Start from bottom up the tree and traverse up and test one node each time
- Calculate cost
- Make decision to prune it or not
References:
Course by University of Washington
