Oversampling and Under-sampling

January 21, 2019October 25, 2020Archit Vora Leave a comment

When data is class-imbalanced there is a tendency to predict majority class. One way to tackle this would be apply more weight to minority classes in cost function. Another way is oversampling and under-smapling.

Over-sampling makes duplicate copies of minority classes
Under sampling randomly removes some samples from majority class
- This should be used with caution
- We need to check once that we still remain with enough sample for a given no of features
Practically we might want to over sample some classes and under-sample others.

Cross validation

Validation set should be taken out from original data[1]
- We can do the sampling just before training only on training data

Reference

[0] : https://www.kaggle.com/residentmario/undersampling-and-oversampling-imbalanced-data

[1] : https://www.marcoaltini.com/blog/dealing-with-imbalanced-data-undersampling-oversampling-and-proper-cross-validation