When data is class-imbalanced there is a tendency to predict majority class. One way to tackle this would be apply more weight to minority classes in cost function. Another way is oversampling and under-smapling.
- Over-sampling makes duplicate copies of minority classes
- Under sampling randomly removes some samples from majority class
- This should be used with caution
- We need to check once that we still remain with enough sample for a given no of features
- Practically we might want to over sample some classes and under-sample others.
Cross validation
- Validation set should be taken out from original data[1]
- We can do the sampling just before training only on training data
Reference
[0] : https://www.kaggle.com/residentmario/undersampling-and-oversampling-imbalanced-data
