ML Fundamentals: 6.16   Honest evaluation: train/val/test discipline

Dr Chris Paton

6.16 Honest evaluation: train/val/test discipline

We close with the practice that, more than any clever algorithmic insight, separates serious ML practitioners from the rest.

The single-use test set

Reserve a test set at the start of the project. Lock it away. Do not look at it. Do not use it to make any decisions. The first time you compute a metric on the test set is when you write the report or the paper.

This sounds easy and is in practice difficult, because human curiosity is strong and the cost of "just one peek" feels small. It is not. Each peek effectively reduces the size of the test set, because you are making development decisions conditioned on it. After enough peeks, the test set has been laundered into a second validation set, with no clean estimate of generalisation left.

The defensive technique is to physically separate the test set from the development environment. Andrew Ng recommends storing it in a directory named "DO NOT TOUCH" and assigning a single team member with access. For Kaggle-style competitions, the test set is invisible to participants and labels are released only at the end.

Leaderboard chasing

When many teams compete on the same public leaderboard, every team's choices are influenced by their public score. Even if no individual team peeks at the test labels, the aggregate peek is enormous: hundreds of teams, thousands of submissions. The public leaderboard has been silently turned into a validation set. The private leaderboard, held out until the end, typically shows the top of the public leaderboard collapse to mediocrity.

Dwork et al. (2015) studied this formally as the adaptive data analysis problem and proposed differentially private leaderboards as a defence. In academic ML, the practical defence is to require submission of code that produces the predictions, to limit submission frequency, and to validate winners on a fresh hold-out.

The benchmark contamination crisis

The broadest threat to honest evaluation in 2024–2026 is benchmark contamination (LiveBench, LiveCodeBench, SWE-Bench Verified and Chatbot Arena are now the contamination-resistant defaults). Modern foundation models are trained on web-scale data, and benchmark questions and answers are widely available on the web. Every well-known benchmark, MMLU, HumanEval, GSM8K, ARC, HellaSwag, appears verbatim in some form in the pre-training corpora of frontier models. The model has effectively been trained on the test.

The consequence is that benchmark numbers are systematically inflated and not comparable across models. A 10% improvement on GSM8K may reflect 1% genuine capability and 9% increased contamination. The literature has responded with several countermeasures: (a) holding back fresh test sets that did not exist at the model's training cutoff (BIG-Bench Hard contains questions written after most training cutoffs; SWE-bench-Verified is updated continuously), (b) per-instance contamination detection (asking the model to recall the question and checking the response), and (c) shifting attention to dynamic benchmarks like Chatbot Arena, where humans pose novel queries.

The take-away for any practitioner: a benchmark number is only as good as the evidence that the model has not seen that benchmark before. Always report contamination analysis if your model could plausibly have been exposed.

What "real progress" looks like

Real progress in machine learning is hard to see on a single dataset because random fluctuation, contamination, and reproducibility issues all conspire to make small improvements look real when they are not. The defenders of progress are:

Multiple benchmarks across modalities.
Out-of-distribution evaluation on data the model has demonstrably not seen.
Replication by independent groups.
Confidence intervals reported with every score, computed by paired bootstrapping or by Nadeau–Bengio corrected $t$-tests.
Public release of code, models, and datasets.

A 0.3% gain on a single benchmark, reported without confidence intervals, by a single group, on a benchmark that has been around for ten years, this is noise. Treat it as such until proven otherwise. Conversely, a model that improves consistently across a dozen unrelated benchmarks, reported with intervals, replicated by others, this is signal.