Handling missing values in Decision Tree

We had previously written about classification trees and regression trees.

In this blog we will mainly write about handling missing values specifically for decision trees.

Handling over-fitting for decision tree is described here.

Missing Values

Missing value can cause issues at :

  1. Training time
  2. Prediction time
    1. What if while prediction sample does not have value for some feature

Purification by Skipping

  • This approach involves skipping rows or entire features that contain missing values.
  • However, this method comes with a drawback as it results in the loss of valuable information.
  • It is important to note that skipping missing values during training will not resolve the issue during prediction time

Imputation

  • Imputation refers to the process of filling in missing values with estimated or imputed values.
  • It provides a means to address missing values during both training and prediction.
  • A dictionary can be created to store imputed values for each feature, which can then be utilized during prediction.
  • For categorical features, a common imputation method is to use the mode, while for numeric features, the average or median is often employed.
  • Advanced techniques, such as the expectation maximization algorithm, can also be utilized for imputation.

However, it’s essential to be aware that imputation can introduce systematic biases into the data. For example, let’s consider a scenario where people from Washington tend to not provide their age, while the rest of the United States does. This could lead to a systematic bias in the imputed values, as the imputation process may assign ages based on data from other regions.

Algo specific technique

In decision tree algorithms, there are specific techniques to handle missing values. At each node in the decision tree, we need to determine which branch the missing value will take. It’s possible for the same feature to have different values at different nodes. To address this, we can make a tweak in the decision tree algorithm:

  1. During feature selection, consider assigning the missing value to each possible branch.
  2. Evaluate the effect of assigning the missing value to each branch by measuring the reduction in classification error.
  3. Choose the feature and branch that result in the highest reduction in classification error when the missing value is assigned.

However, what if we encounter a node during prediction where we didn’t have any missing values during training? In such cases, we can store the direction with the most number of samples at that node as the default direction for missing values. Let’s call this direction “max_sample.”

If we encounter missing values while training the decision tree, we can override the default “max_sample” direction with the observed direction for the missing values.

By incorporating these techniques, we can effectively handle missing values in decision trees and make informed decisions based on the available data.

References:

Course by University of Washington

https://www.coursera.org/learn/ml-classification

One thought on “Handling missing values in Decision Tree

Leave a comment