ML Fundamentals: 6.10   Pipelines and leakage

Dr Chris Paton

6.10 Pipelines and leakage

The single most common cause of overestimated performance in deployed ML is leakage: information from outside the training set sneaks into the model and inflates apparent performance. Leakage is insidious because the model passes every internal validation check yet fails on first contact with reality.

Preprocessing leakage

The textbook example. You want to standardise your features. You compute the mean and standard deviation on the entire dataset, including the test rows, and only then split into train and test. The resulting test-set scores are too good, because the model has implicitly seen the test-set summary statistics. The fix is to compute scaling statistics only on the training fold and apply them to the validation/test fold. Any preprocessing that depends on the data, scaling, imputation, target encoding, PCA, feature selection, must live inside the cross-validation loop. scikit-learn's Pipeline is the right abstraction.

Target leakage

A feature accidentally contains information about the target that would not be available at prediction time.

A model predicting whether a patient develops pneumonia includes "received antibiotics" as a feature. Antibiotics are prescribed after diagnosis, not before. The feature is a perfect predictor in training, but at inference time you do not yet know whether antibiotics will be prescribed.
A churn model includes "days since last login," but the test data was filtered to exclude users who had not logged in for 30 days. The threshold of 30 days is the label, hidden in disguise.
A credit-risk model includes "principal balance at month 12" to predict default by month 12. Default has already happened by month 12.

Target leakage is hard to spot because the offending feature does correlate with the label genuinely, only with a temporal direction that is wrong. The defence: list every feature explicitly and ask "could I observe this before the prediction time?".

Train–test contamination

The same row, or a near-duplicate, appears in both train and test. Common in scraped datasets. Image deduplication, text near-duplicate detection (MinHash, LSH), and joining on unique IDs before splitting are the standard defences. The consequences of contamination at the benchmark level (the LLM benchmark crisis) are serious enough that we discuss them separately in §6.16.

Group leakage

Multiple readings from the same patient, or multiple comments from the same user, end up split between train and test. Even though the rows are different, the underlying entities are the same; the model effectively memorises the entity. The fix is GroupKFold with the entity identifier as the group key.

Temporal leakage

Random splitting in time-series data lets the model train on the future and test on the past. The fix is forward-chaining cross-validation: train on $[1, t]$, test on $(t, t+w]$, then advance.

A defensive checklist

Before reporting a number:

List every feature. For each, ask: when was it generated relative to the prediction time?
Identify the unit of analysis (patient, customer, document). Is the split done at that level?
Are all preprocessing steps fitted only on the training fold?
Are there near-duplicates? Did you check the hash distribution?
Are train and test from the same time window, or is there a temporal split?
Could the labels themselves be derivable from any combination of features?

A model that fails any of these is best suspected of leakage until proven innocent.