Oversampling and Under-sampling

When data is class-imbalanced there is a tendency to predict majority class. One way to tackle this would be apply more weight to minority classes in cost function. Another way is oversampling and under-smapling.

  • Over-sampling makes duplicate copies of minority classes
  • Under sampling randomly removes some samples from majority class
    • This should be used with caution
    • We need to check once that we still remain with enough sample for a given no of features
  • Practically we might want to over sample some classes and under-sample others.

 

Cross validation

  • Validation set should be taken out from original data[1]
    • We can do the sampling just before training only on training data

 

Reference

[0] : https://www.kaggle.com/residentmario/undersampling-and-oversampling-imbalanced-data

[1] : https://www.marcoaltini.com/blog/dealing-with-imbalanced-data-undersampling-oversampling-and-proper-cross-validation