ML Fundamentals: 6.15   Imbalanced classes

Dr Chris Paton

6.15 Imbalanced classes

Classification with imbalanced classes is the rule, not the exception, in real applications. Fraud is rare. Disease is rare. Defects are rare. The prevalence of the positive class can be 1 in 100, 1 in 10,000, or 1 in 10⁶.

Naive training optimises accuracy, which on imbalanced data is achieved by predicting the majority class. The model's "success" is mistaken for skill. The remedies fall into three families.

Resampling

Random oversampling. Duplicate minority examples until classes are balanced. Simple, but causes the model to memorise duplicated examples.
Random undersampling. Drop majority examples. Cheap, but throws away information.
SMOTE Chawla, 2002. Synthetic Minority Oversampling Technique. For each minority example, draw a line to one of its $k$ minority neighbours and pick a random point on it. Generates synthetic minority examples without exact duplicates.
SMOTE-Tomek, ADASYN, Borderline-SMOTE. Variants that focus on the boundary between classes.

Loss reweighting

Multiply the loss for minority examples by a factor that compensates for their rarity. The simplest weighting is inverse class frequency: $w_c = N / (C \cdot n_c)$. Focal loss Lin, 2017 reweights examples rather than classes: $$ \text{FL}(\hat p, y) = -(1 - \hat p_y)^\gamma \log \hat p_y. $$ The factor $(1 - \hat p_y)^\gamma$ down-weights well-classified examples and forces the model to focus on hard ones. Originally proposed for object detection (where most anchors are background and easily classified), it has become standard wherever extreme imbalance meets a deep model.

Threshold tuning

A probabilistic classifier produces a probability; the binary decision depends on the threshold. The default 0.5 is rarely optimal. For a given operating point, a target precision, a target recall, a fixed false-positive budget, the right threshold is determined post-hoc on a held-out set. Evaluate on the precision–recall or ROC curve; pick the operating point that meets the business constraint; deploy. This is far cheaper than retraining and often as effective as resampling.

A caveat

Resampling and reweighting both distort the training distribution. A model trained on rebalanced data will be miscalibrated on the original data: its probabilities no longer reflect base rates. If the downstream system uses the probability, for risk stratification, expected-utility decisions, or Bayesian updating, you must re-calibrate post-hoc. Platt scaling or isotonic regression on a held-out (un-rebalanced) calibration set is the standard fix.