Statistics: 5.13   Model Evaluation and Selection

Dr Chris Paton

5.13 Model Evaluation and Selection

Building a model is the easy part. You pick an architecture, choose a loss, hit train, and after a while you have something that produces predictions. The hard part is what comes next. Is this model any good? Is it better than the simpler one you tried last week? Will the accuracy you measured this afternoon survive contact with next month's data, or will it evaporate the moment your model meets users it has never seen?

These questions are the engineering of model selection, and they are where most of the careful statistical thinking in a machine-learning project lives. A confident leaderboard number can vanish in production for a dozen reasons: the test set was contaminated, the validation protocol was lax, hyperparameters were tuned against the wrong split, or the comparison between two models never accounted for the variability of the comparison itself. The mathematics in this section is light. The discipline is heavy.

In §5.12 we built hierarchical priors that let information flow between groups. That was about the structure of a model. This section is about the judgement of a model: cross-validation to estimate how it will generalise, information criteria to compare candidates, hold-out sets to give an unbiased final number, and the pitfalls that quietly inflate your estimates if you are not careful. You will use these ideas continually from Chapter 6 onwards, where every learning algorithm we introduce comes with the same question attached: how do we know it works?

Symbols Used Here

$\mathcal{L}_{\text{train}}, \mathcal{L}_{\text{val}}, \mathcal{L}_{\text{test}}$train/validation/test loss

$K$number of folds in cross-validation

$\text{AIC}, \text{BIC}$Akaike, Bayesian information criteria

$p$number of parameters

$n$sample size

$\hat\ell$maximised log-likelihood of a fitted model

Train/validation/test split

The first rule of model evaluation is that you must not use the same data to fit a model and to judge it. A model fit to a dataset has, in a real sense, memorised parts of that dataset. Asking it how well it does on those very examples is asking a student to grade their own homework using a copy of the answers. The score you get is not an estimate of how well the model will do on new data; it is an estimate of how well the model has learned to repeat what you showed it.

The standard remedy is a three-way split. A typical division is 70% training, 15% validation, 15% test, though for very large datasets you can get away with much smaller validation and test fractions (a 90/5/5 split works fine when 5% is still tens of thousands of examples). The roles of the three sets are different and must be kept separate.

The training set is what the optimiser sees. Gradients are computed against it, weights are updated against it, and the model fits its parameters to it.

The validation set is for choices about the model that are not made by the optimiser: the learning rate, the depth of the network, the regularisation strength, the choice between two architectures. You fit a model on training, score it on validation, change something, repeat. The validation set is the dial you turn to tune your design.

The test set is for the single, final, unbiased estimate of how the chosen model will perform in the wild. You touch it once.

That last sentence is the rule that beginners break most often. The moment you look at the test set during model development, even glancing, its statistical guarantees are gone. Suppose you train ten models, evaluate each on the test set, and pick the best. The number you report is no longer an unbiased estimate of generalisation. It is the maximum of ten noisy estimates, which is biased upward. You have, by stealth, used the test set as a validation set. Pre-registering your final model before the test set is touched, and reporting the very first number you see, is the only way to keep the guarantee intact.

k-fold cross-validation

A single train/validation split has an obvious weakness: the validation score depends on which examples happened to land in validation. With small datasets, this variability is large, and a single split can be misleading. k-fold cross-validation averages over the choice.

Partition the data into $K$ equal-sized folds. For each $k = 1, \ldots, K$, train on the other $K - 1$ folds and evaluate on fold $k$. You now have $K$ validation scores, one per fold; their mean is the cross-validation estimate of generalisation error, and their standard deviation tells you how confident you should be in that estimate. The most common choices are $K = 5$ or $K = 10$, which strike a balance between statistical efficiency and computational cost.

The advantages are real. Every example is used for training in $K-1$ of the folds and for validation in exactly one, so no data is wasted. The averaging reduces the variance of the estimate. And the spread across folds gives you a free standard error.

The disadvantages are also real. You train $K$ models instead of one, so the compute bill is multiplied. For large datasets, say, ImageNet-scale problems where one training run already takes days, a single, large validation set is usually enough and cross-validation is impractical. Cross-validation is most useful in the data-poor regime, which is most of clinical statistics, much of scientific research, and a surprising amount of small-business data work.

A worked example fixes the idea. Suppose you have 1000 examples and you choose $K = 5$. You partition them into five folds of 200 examples each. In round 1, you train on folds 2–5 (800 examples) and evaluate on fold 1 (200 examples), producing a validation loss $\ell_1$. In round 2, you train on folds 1, 3, 4, 5 and evaluate on fold 2, producing $\ell_2$. After all five rounds, your CV estimate of generalisation loss is $(\ell_1 + \ell_2 + \ell_3 + \ell_4 + \ell_5)/5$, with a standard error of roughly $\text{sd}(\ell_k) / \sqrt{5}$.

Two refinements matter in practice. Stratified k-fold ensures each fold has the same proportion of each class as the full dataset, which is essential for imbalanced classification. Group k-fold keeps related observations together, all of one patient's records, all measurements from one site, all frames from one video, so that the validation fold cannot leak information about the training folds. For time series, you should never train on the future and validate on the past; use forward-chaining splits instead.

Leave-one-out (LOOCV)

Pushing $K$ to its extreme gives leave-one-out cross-validation, where $K = n$. You train on $n - 1$ examples, validate on the remaining one, and rotate through all $n$ choices. The validation set is a single example each time, so $n$ models are trained in total.

LOOCV has a near-zero bias for the generalisation error of a model trained on $n$ examples, at each round, the training set has $n - 1$ examples, almost the full size, but it has higher variance than $K = 5$ or $10$. The intuition is that the $n$ models are highly correlated with each other, since any two of them differ in only two examples. Averaging correlated estimates does less to reduce variance than averaging independent ones.

LOOCV is also expensive. Training $n$ models is fine when $n = 50$ and your model fits in a second; it is a non-starter when $n = 100{,}000$ and each fit takes an hour. Its niche is small-data regimes where every example is precious: clinical studies with a few dozen patients, ecological field studies, costly laboratory experiments. There you are willing to pay the compute cost in exchange for the most efficient use of every observation.

For a few model classes, most notably linear regression with a quadratic loss, LOOCV has a closed-form shortcut and can be computed from a single fit using the so-called PRESS statistic. That is a happy exception; for most modern models you really do have to refit $n$ times, and the price quickly becomes prohibitive.

A small clinical example illustrates the trade-off. Suppose you have 40 patients, each with a vector of biomarkers and a binary outcome (responds to treatment, does not). A 5-fold split leaves only 8 patients per validation fold, and the validation accuracy estimate from a single split is so noisy as to be meaningless. With LOOCV, every patient is held out once, you train 40 models, and the resulting estimate of accuracy, though imperfect, is the best you can extract from the data you have. The computational cost is modest because each model is small, and the alternative (a single 80/20 split) would simply throw information away.

Information criteria

Cross-validation is empirical: you actually fit and evaluate. Information criteria offer a cheaper, theoretical alternative for likelihood-based models. Each is a single number computed from one fit, balancing how well the model fits against how complex it is.

The Akaike Information Criterion is

$$\operatorname{AIC} = -2 \hat\ell + 2p,$$

where $\hat\ell$ is the maximised log-likelihood of the fitted model and $p$ is the number of free parameters. The first term rewards good fit; the second penalises complexity. Lower AIC is better. AIC is derived as an asymptotic estimate of the Kullback–Leibler divergence between the model and the unknown true data-generating process, so it is selecting for predictive performance.

The Bayesian Information Criterion is

$$\operatorname{BIC} = -2 \hat\ell + p \log n,$$

where $n$ is the sample size. The penalty is heavier than AIC's for any $n > 7$ (because $\log n > 2$), and it grows with the dataset. Lower BIC is better. BIC arises as an approximation to the negative log marginal likelihood under a Bayesian model, and it is consistent for model identification, given enough data and a true model in the candidate set, BIC will pick it with probability approaching one.

The two criteria answer subtly different questions and routinely disagree. AIC tends to favour slightly larger models because it is hunting for predictive accuracy; BIC tends to favour smaller, simpler models because it is hunting for the true structure. Neither is universally right. In modern deep learning, both are largely uninformative, likelihoods are intractable for many architectures, parameter counts vastly exceed sample sizes, and the asymptotic approximations break down, so practitioners fall back on cross-validation and held-out validation. For traditional statistical models with well-defined likelihoods, AIC and BIC remain useful and quick.

The optimism of training error

Training error is biased downward as an estimate of generalisation error. The model has seen the training data, optimised against it, and possibly memorised parts of it. Reporting training accuracy as if it were a measure of how the model will perform in deployment is one of the most common mistakes in applied machine learning.

The gap between training error and validation or test error is the generalisation gap. A small gap means the model has learned something general; a large gap means it has overfit, capturing patterns that exist only in the training set.

Three factors enlarge the gap. Larger models have more parameters available to memorise idiosyncrasies. Smaller datasets offer less signal and more noise to fit. Less regularisation, weaker weight decay, no dropout, no early stopping, gives the optimiser more freedom to chase training loss into territory that does not generalise.

A canonical worked example: a deep convolutional network with 100 million parameters trained on 10,000 images may achieve 100% training accuracy and 60% test accuracy. The model has learned the training set perfectly, including its quirks; on new images, it falls apart. The 40-point gap is overfitting in concentrated form. The remedy is some combination of more data, a smaller or more regularised model, and stronger evaluation discipline.

The deeper point is that training error tells you almost nothing on its own. It only becomes informative when paired with a validation or test error from data the model has not seen. A model with low training error and high validation error has overfit. A model with high training error has underfit, it could not even learn the training data, so something is broken in the architecture, the optimisation, or the labels.

Comparing models with significance tests

You have two models and want to know whether one is genuinely better than the other or whether the difference is noise. Treating their test scores as two independent numbers is too coarse, because both were evaluated on the same test set. The right tool is a paired test, where each test example is a paired observation contributing to both models.

For binary classification on the same examples, McNemar's test is the standard choice. It compares the off-diagonal cells of a 2-by-2 table: the number of examples model A got right and model B got wrong, against the number where the situation is reversed. The test asks whether these disagreements are roughly balanced (the models are equivalent) or skewed (one model is genuinely better).

For continuous metrics, the paired bootstrap gives a confidence interval on the difference in scores. Resample the test set with replacement, recompute both models' metrics on the resample, take the difference, and repeat thousands of times. The 2.5th and 97.5th percentiles of the resulting distribution give a 95% CI on the difference. If the CI excludes zero, the difference is statistically meaningful at the 5% level.

A subtler trap is selection on the maximum. If you compare $K$ candidate models on a single test set and report the best one, you have implicitly used the test set to choose between $K$ alternatives, and the maximum of $K$ noisy estimates is biased upward. The bias grows with $K$. Adjust by holding out a separate final test set, by reporting all $K$ scores instead of just the maximum, or by applying a multiple-comparisons correction to the significance test. Reporting "we tried 50 architectures and the best one beat the baseline" without disclosing the 50 is the methodological equivalent of cherry-picking.

Practical workflow

A defensible end-to-end workflow looks like this.

Train on the training set; validate on the development set. Tune hyperparameters, compare architectures, decide what to deploy. Iterate freely. The validation set will get used many times, and any final number on it is contaminated by the search.
Once your final model is chosen, evaluate on the held-out test set exactly once. Run inference, compute metrics, write the number down. Do not now go back, change something, and re-evaluate. If you do, your test set has just become another validation set.
Report test performance with a confidence interval. A point estimate without a CI hides whether the result is plausibly attributable to noise. Bootstrap the test set or compute an analytical CI appropriate to the metric.
If you don't trust your test set, get a fresh one. This matters more than people realise. Standard test sets, ImageNet's, CIFAR-10's, GLUE's, have been evaluated on so many times in published research that they have effectively become extended validation sets for the field. If your test set may have leaked into training data, may have been seen during model development, or may simply be too easy to be informative, find a new one. Real-world deployments often need a prospective test set: data collected after the model was finalised, so contamination is impossible by construction.

The discipline behind these four steps is what separates published numbers that hold up from published numbers that quietly fail to replicate. The mathematics is not the hard part. The honesty is.

What you should take away

Three sets, three roles. Train fits, validation tunes, test reports. Touch the test set once and only once.
Cross-validation is for small data. $K = 5$ or $K = 10$ gives a low-variance estimate of generalisation when one held-out set would be too small or too noisy. LOOCV is for very small data only.
AIC and BIC are quick comparators for likelihood-based models. $\operatorname{AIC} = -2\hat\ell + 2p$ targets predictive performance; $\operatorname{BIC} = -2\hat\ell + p \log n$ targets identification of the true model. They disagree for principled reasons.
Training error is optimistic; the generalisation gap is the diagnosis. If train is good and test is poor, you have overfit; if both are poor, you have underfit. Never report training accuracy alone.
Paired comparisons, confidence intervals, no peeking. When comparing models on a shared test set, use paired tests and bootstrap CIs. When choosing the best of many candidates, account for the maximum's upward bias. The leaderboard numbers that survive deployment are the ones that came out of disciplined protocols, not lucky seeds.