Statistics: 5.15   Statistics for AI Evaluation

Dr Chris Paton

5.15 Statistics for AI Evaluation

Open any AI paper from the last five years and you will find a results table. Each cell holds a number, accuracy, F1, BLEU, win-rate, perplexity. The numbers are presented with confidence, often to three decimal places, and the difference between two competing systems is sometimes a fraction of a single percentage point. Headlines are written from those decimals. Funding decisions are made from them. Engineers choose which model to deploy on the basis of them.

The natural beginner's question is simply: are those numbers reliable? If model A scores 73.2 and model B scores 72.8, has A really beaten B, or could the gap have appeared by chance? If a fresh research group repeated the experiment with a different random seed and a different shuffle of the test set, would the ranking flip? Can a one-percent improvement on a benchmark ever be statistically meaningful, or is it noise dressed up as progress?

This section translates the everyday practice of AI evaluation into the statistical language built up earlier in the chapter. It is not a new branch of statistics. The tools we need, confidence intervals, paired hypothesis tests, the bootstrap, calibration measures, multiple-comparisons corrections, are exactly the tools introduced in §5.7 through §5.13. What changes is the setting. Instead of a clinical trial with a few hundred patients, we have a test set with ten thousand examples. Instead of two drugs, we compare two checkpoints. Instead of a single primary endpoint, we have dozens of benchmarks. The arithmetic transfers; the discipline must transfer with it.

A short bridge before we begin. Sections 5.7 to 5.9 gave us general-purpose machinery: how to put error bars on a single estimate, how to test whether two estimates differ, and how to use resampling when distributional assumptions fail. Section 5.13 added the language of model evaluation in the abstract, train, validation, test, cross-validation. This section is the meeting point. It applies that machinery to the concrete numbers reported in the AI literature, and it names the most common ways those numbers are misused.

Symbols Used Here

$\hat p$observed accuracy

$n$test-set size

$\sigma$standard error

Confidence intervals on accuracy

When a model is evaluated on $n$ test examples and scored as right or wrong on each, the natural summary is the proportion correct, $\hat p = x/n$. The first instinct of a beginner is to treat $\hat p$ as a single, definitive number: 87 of 100 correct, accuracy 0.87, done. The statistical instinct is to ask how much that number would wobble if we drew a fresh test set from the same population. That wobble is a confidence interval.

The simplest CI for a proportion is the normal approximation: $\hat p \pm z\sqrt{\hat p (1-\hat p)/n}$, with $z \approx 1.96$ for 95% coverage. This works fine when $\hat p$ is near 0.5 and $n$ is large. It misbehaves badly when $\hat p$ is near 0 or 1, or when $n$ is small, because the symmetric interval can extend below zero or above one, which is impossible for a probability.

A better choice, and the one the AI evaluation community is slowly moving towards, is the Wilson score interval:

$$\frac{\hat p + z^2/(2n)}{1 + z^2/n} \pm \frac{z}{1 + z^2/n}\sqrt{\frac{\hat p(1-\hat p)}{n} + \frac{z^2}{4n^2}}.$$

It looks ugly, but conceptually it is just the normal interval with two corrections: the centre is pulled gently towards 0.5, and the half-width is adjusted so the interval stays inside $[0,1]$.

A worked example makes the difference vivid. Suppose a small benchmark has 100 questions and a model gets 10 right. Then $\hat p = 0.10$. The naive normal CI is $0.10 \pm 1.96 \sqrt{0.10 \times 0.90/100} \approx 0.10 \pm 0.059$, i.e. $[0.041, 0.159]$, and on a different test set with $\hat p = 0.02$, the same recipe would dip below zero. The Wilson interval for 10 out of 100 is approximately $[0.055, 0.176]$: still not symmetric around the point estimate, but properly bounded and slightly wider on the upside, where there is more room. For modest $n$ near a boundary, exactly the regime in which small evaluation sets and rare error categories live, always prefer Wilson.

Comparing two systems on the same test set

The single most important habit for beginners to absorb is this: when you compare two systems on the same examples, use a paired test. Treating the two accuracies as independent throws away the fact that examples have intrinsic difficulty, the easy ones are easy for both systems, the hard ones are hard for both. Pairing exploits that correlation and gives a much sharper test.

For binary classification, the standard paired test is McNemar's test. Build a 2 by 2 table with four cells: (A right, B right), (A right, B wrong), (A wrong, B right), and (A wrong, B wrong). The agreement cells, both right and both wrong, tell you nothing about which system is better. All the information lives in the disagreement cells: the cases where exactly one system got the answer. McNemar's statistic is built from those two cells alone. If the disagreement is symmetric, A and B each occasionally beat the other on different examples, the systems are tied. If A beats B on many more examples than the reverse, A wins.

A small worked sketch. Two language models are tested on 1,000 multiple-choice questions. Model A is right on 820, model B on 840. Naively that is a 2 percentage-point gap. But the disagreement cells show 90 cases where A was right and B wrong, and 110 cases where B was right and A wrong. McNemar's test compares 90 to 110 against the null of equal probability, a small binomial calculation that gives a p-value around 0.18. The 2-point gap is real but unconvincing on this test set, because most of the agreement was already shared.

For metrics other than binary accuracy, F1, BLEU, ROUGE, exact-match, the paired bootstrap generalises the same idea. Resample the test examples with replacement, recompute both systems' metrics on each resample, and take the percentile interval of the difference. If the interval crosses zero, the systems are statistically tied. If it lies cleanly above zero, one is better. Ten thousand bootstrap resamples is a safe default; the calculation is embarrassingly parallel and rarely a bottleneck. Crucially, both systems must be evaluated on the same resampled subset on each iteration, that is what makes the bootstrap paired rather than independent.

Multiple seeds and stochastic training

Modern neural networks are stochastic. Initialisation, data ordering, dropout masks, and even non-deterministic GPU kernels mean that training the same architecture on the same data with two different random seeds produces two different sets of weights and, consequently, two different test accuracies. Reporting a single number therefore answers the wrong question. The right question is: what is the distribution of test accuracies that this training procedure tends to produce?

Best practice is to train with at least five seeds, ten is better when compute allows, and report mean and standard deviation, or a 95% confidence interval based on the seed-to-seed variability.

A worked example. A small classification model trained five times with different seeds gives test accuracies $\{72.1, 73.5, 71.8, 72.9, 73.2\}$. The mean is $72.7$ and the sample standard deviation (with the unbiased $n-1$ denominator) is approximately $0.72$. A 95% CI on the seed mean uses the t-distribution with four degrees of freedom: $72.7 \pm 2.78 \times 0.72/\sqrt{5} \approx 72.7 \pm 0.90$, giving roughly $[71.80, 73.60]$.

Now suppose a competing model, trained with the same protocol, has a seed-mean of 73.4. That number sits comfortably inside the first model's interval. The correct conclusion is that the two procedures produce indistinguishable performance on this task; the gap is within the noise floor of training itself. A paper that reports only one seed of each, say 73.5 versus 73.4, and headlines a 0.1-point improvement is not telling you about the models. It is telling you about the random seed.

The deeper point is that stochastic training has its own variance, separate from test-set variance. A confidence interval that reflects only test-set sampling underestimates the true uncertainty. Reporting both, multi-seed mean with seed CI, plus per-seed test-set CI, is the gold standard. Most published papers fall short of it.

Benchmark contamination

Pre-trained models are trained on enormous web crawls. Those crawls inevitably contain copies of public benchmarks: the questions from MMLU, the problems from GSM8K, the test items from TriviaQA. When a model has seen the test set during pretraining, its score on that test set measures memorisation, not generalisation.

Symptoms include suspiciously low entropy on test answers, near-perfect performance on items that should be hard, and a model that produces verbatim phrasing from the benchmark. Mitigations are several. Held-out test sets created strictly after a model's training cutoff are the cleanest defence. Dynamic benchmarks such as HumanEval-X and FrontierMath rotate items to outpace contamination. Private test sets, scored by a leaderboard rather than released, prevent leakage by construction. Post-cutoff data, fresh news, recent code, new exam papers, gives a contamination-free probe of generalisation.

For beginners, the takeaway is simple. A score on a public benchmark released before a model's training cutoff is a contaminated measurement until proven otherwise. Always ask: was this test set in the pretraining corpus? If the answer is yes, or unknown, the headline number is a memorisation probe, not a measure of generalisation.

Multiple comparisons in benchmarks

A frontier model is now routinely reported on dozens of benchmarks at once. If you run twenty independent tests at the 5% significance level, you expect about one false positive purely by chance. Picking the benchmark on which your new model wins, and headlining that benchmark, is statistical cheating dressed as research.

The classical fix is the Bonferroni correction: divide the significance threshold by the number of tests. If you compare on twenty benchmarks and want family-wise error of 5%, demand p < 0.0025 on each. Less conservative alternatives, Holm's step-down procedure, the Benjamini-Hochberg false discovery rate, are also fine, and often more powerful when many tests are genuinely positive.

The minimum reporting standard is to disclose all benchmarks tried, not just the winners, and to declare how many comparisons sit behind any p-value or claim of significance. Pre-registering the benchmark suite before running the experiments, the same discipline imposed on clinical trials, is the strongest safeguard, and it costs nothing but discipline.

Calibration and reliability

Accuracy says whether the model's top answer was right. Calibration asks whether the model's stated confidence matches reality. A well-calibrated model that says "I am 80% confident" should be right about 80% of the time when it makes such claims; a model that says 80% but is right 50% of the time is overconfident.

Three tools quantify calibration. A reliability diagram bins predictions by stated confidence and plots the empirical accuracy in each bin against the bin's confidence midpoint. A perfectly calibrated model lies on the diagonal. Expected calibration error (ECE) is the weighted average gap between confidence and accuracy across bins, a single scalar summary of the diagram. The Brier score combines accuracy and calibration into one number: the mean squared difference between predicted probability and the binary outcome.

Why this matters in practice. A medical AI that is 90% confident and wrong is far more dangerous than one that is 60% confident and wrong, because downstream decisions are taken from the confidence. Modern frontier models tend to be overconfident on the hardest items, exactly where overconfidence costs most. Reporting calibration alongside accuracy is increasingly expected, and selective prediction (refusing to answer below a confidence threshold) is a real safety knob that only works if the probabilities are well calibrated.

Beginners should read every accuracy claim with the question "and how confident was the model?" attached.

Robustness evaluations

Standard test accuracy measures performance on examples drawn from the same distribution as the training data. The real world is rarely so kind. Robustness evaluations probe what happens when the input shifts: a small adversarial perturbation, a different population, a domain the model was not trained on, an image taken in unusual lighting.

Three categories of robustness test recur in modern AI papers. Adversarial accuracy measures performance under worst-case perturbations crafted by an attacker; the gap between clean and adversarial accuracy is often dramatic. Distribution-shift evaluations test on a held-out source, a different hospital, a different demographic, a different decade of news, to see whether learned features generalise. Out-of-distribution detection asks whether the model can flag inputs that fall outside its training distribution at all, rather than confidently producing nonsense.

Each of these needs its own held-out set, often constructed adversarially, and each requires its own confidence intervals and paired comparisons. A model that scores 95% on the standard test set and 30% under modest distribution shift is not a 95% model in any honest sense. The numbers reported in deployment-relevant settings are almost always worse than the numbers on the headline benchmark, sometimes by an order of magnitude.

The lesson for beginners: standard test accuracy alone tells you very little about real-world reliability. Always look for the robustness columns of the table, and treat any system without them as unevaluated for deployment.

What you should take away

Always attach uncertainty to every reported number. Use Wilson intervals for proportions, the bootstrap for complex metrics, and seed-to-seed CIs for stochastic training.
Prefer paired tests when comparing two systems on the same data. McNemar for binary outcomes, paired bootstrap for everything else. Pairing recovers the power that independence assumptions throw away.
Train with multiple seeds and report the spread. A single seed is an anecdote, not a measurement.
Treat any benchmark released before the model's cutoff as potentially contaminated, and correct for the number of benchmarks compared when claiming significance.
Look beyond accuracy. Calibration, robustness under distribution shift, and adversarial behaviour are part of evaluation, not optional extras.