Skip to content

Handling unbalanced data¶

What is unbalanced data?¶

Large discreptancy between "positive" and "negatrive" cases
- i.e., fraud detection. Fraid is rate and most rows will be non-fraud.
Mainly a problem with neural networks.

Oversampling¶

Duplicate samples from the minority class
Can be done at random

Undersampling¶

Instead of creating more positive samples, remove the negative ones
Throwing data away is usually not the right answer
- Unless you are specifically trying to avoid "big data" scaling issues.

SMOTE¶

Synthetic minority over-sampling technique
Artificially generate new samples of the minority class using nearest neighbors
- Run KNN of each sample of the minority class
- Create a new sample from the KNN result (mean of the neighbors)
Both generates new samples and undersamples majority class
Generally better than just oversampling

Adjusting thresholds¶

When making predictions about a classification (fraud / not fraud), you have some sort of threshold of probability at which point you'll flag something as the positive case (fraud).
If you have too many false positives, one way to fix that is to simply increase that threshold.
- Guaranteed to reduce false positives
- But, could result in more false negatives