ML Fundamentals: 6.14   Common evaluation metrics

Dr Chris Paton

6.14 Common evaluation metrics

When a beginner first trains a machine-learning model, the very first question they ask is the most natural one in the world: how good is it? The instinct is to look at accuracy on the test set, see a high number, and feel satisfied. The instinct is wrong. A model's score on the test set is a single number summarising thousands of predictions, and the question of which number to compute is far from neutral. Different metrics measure different things, and they can disagree wildly. A spam filter that gets 99 percent of emails right may still let through every single piece of phishing. A medical-diagnosis model that catches every case of disease may also flag thousands of healthy people. A translation system that picks the most likely word at every step may produce fluent nonsense. The right metric depends on the task: regression error for continuous outputs, F1 for imbalanced classification, ROC-AUC for ranking, BLEU for translation, perplexity for language models, NDCG for ranking. This section catalogues the standard metrics with worked examples, so that you know not just how to compute each one but when it is the right thing to ask for.

§6.16 below pairs with this section: it covers train/val/test discipline, the workflow context in which the metrics here are computed. This section teaches you how to measure performance correctly once those splits are in place.

Symbols Used Here

$\hat y, y$predicted, true

$\text{TP}, \text{FP}, \text{FN}, \text{TN}$confusion matrix counts

$\text{Precision}, \text{Recall}, F_1$per-class metrics

$\text{AUC}$area under curve

Regression metrics

When the thing you are predicting is a continuous number, the price of a house, tomorrow's temperature, the time a patient will spend in hospital, you are doing regression, and your metric measures how far off your predictions are from the true values. There is no single right answer. The correct choice depends on whether large errors should be punished more harshly than small ones, whether you care about the units of the answer, and whether your data contains outliers that one extreme observation could distort.

The default starting point is the mean squared error, defined as $$ \text{MSE} = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat y_i)^2. $$ Squaring each error before averaging means a prediction that is off by ten units contributes a hundred times more than a prediction off by one. This is desirable when large errors are genuinely bad, a weather model that occasionally forecasts snow on a hot summer day deserves to be punished severely. It is undesirable when your data has wild outliers that pull the average towards them; a single mistake can dominate the score. MSE is also in squared units, which is rarely what a human reader wants.

The mean absolute error, $\text{MAE} = \frac{1}{n}\sum_i |y_i - \hat y_i|$, dispenses with the square. Every error contributes in proportion to its size, no more and no less. MAE is more robust to outliers and easier to interpret: an MAE of three pounds means the typical prediction is off by about three pounds. Its mathematical optimum is the conditional median of the target, whereas MSE's optimum is the conditional mean, a small but sometimes consequential difference.

The root mean squared error, $\text{RMSE} = \sqrt{\text{MSE}}$, is MSE's gentler cousin. By taking the square root, you put the metric back into the original units of the target. It still penalises large errors more than small ones, but it can be reported as "off by about £15,000 on average" rather than "off by 225 million pounds-squared".

The mean absolute percentage error, $\text{MAPE} = \frac{100}{n}\sum_i |y_i - \hat y_i|/|y_i|$, expresses error as a percentage. It is scale-independent, which makes it useful for comparing models across datasets of different magnitudes, but it explodes when any true value is near zero and is asymmetric, a forecast of 200 when the truth is 100 looks worse than 0 when the truth is 100.

Finally, R-squared, $R^2 = 1 - \text{RSS}/\text{TSS}$, expresses the fraction of variance the model explains relative to a baseline of always predicting the mean. R² of 1 is perfection; R² of 0 means your model does no better than a constant; R² below zero means it does worse, which is genuinely bad news.

A worked example fixes the ideas. Suppose your three predictions are $(2.1, 0.5, -0.3)$ and the true values are $(2, 1, 0)$. The errors are $0.1$, $-0.5$, and $-0.3$. Squaring gives $0.01, 0.25, 0.09$; averaging gives MSE $\approx 0.117$. Taking absolute values gives $0.1, 0.5, 0.3$; averaging gives MAE $= 0.3$. The mean of the truth is $1$, so the total sum of squares is $(2{-}1)^2 + (1{-}1)^2 + (0{-}1)^2 = 2$; the residual sum of squares is $0.35$; so $R^2 = 1 - 0.35/2 \approx 0.825$. (A slightly different convention gives 0.79.) The model is not perfect, but it captures most of the variation.

Confusion matrix and basic classification metrics

For classification, predicting which of several discrete categories an example belongs to, the picture changes entirely. There is no notion of "off by 0.3"; either the prediction is correct, or it is not. To analyse the kinds of mistakes a model makes, we use the confusion matrix, which tabulates predictions against truths. For binary classification (positive vs negative, spam vs ham, disease vs healthy), the matrix has four cells:

	Predicted positive	Predicted negative
Actual positive	TP	FN
Actual negative	FP	TN

A true positive (TP) is a real positive correctly flagged. A false negative (FN) is a real positive missed. A false positive (FP) is a negative incorrectly flagged. A true negative (TN) is a negative correctly cleared. Almost every classification metric is a different way of squeezing these four numbers into a single score.

The most familiar metric is accuracy, $(\text{TP} + \text{TN})/n$, the fraction of predictions that match the truth. It is intuitive, easily reported, and frequently disastrous. The trouble is that accuracy treats every example identically, which fails badly when one class is much rarer than the other. Imagine a disease that affects one person in a thousand. A model that predicts "no disease" for every patient achieves 99.9 percent accuracy and is utterly useless. Whenever the classes are imbalanced, and in real applications they nearly always are, accuracy alone will mislead you.

Precision answers a different question: of the cases the model labelled positive, what fraction really were positive? Formally, $\text{Precision} = \text{TP}/(\text{TP} + \text{FP})$. High precision means that when the model says "yes", you can trust it. A precision of 0.9 means nine out of every ten alarms are real, and one is a false alarm. Precision is what matters when false alarms are expensive: pulling someone out of a queue at the airport, sending them for invasive surgery, freezing their bank account.

Recall, also called sensitivity or the true-positive rate, asks the opposite question: of all the real positives in the world, what fraction did the model catch? $\text{Recall} = \text{TP}/(\text{TP} + \text{FN})$. High recall means few real cases slip past unnoticed. A recall of 0.8 means we catch eight out of every ten genuine positives, but miss two. Recall is what matters when missed cases are dangerous: cancer screening, intrusion detection, child-safety filters.

Precision and recall pull in opposite directions. To improve precision, the easy trick is to predict "yes" only when you are very confident, at the cost of missing borderline real cases (lower recall). To improve recall, the easy trick is to predict "yes" generously, at the cost of more false alarms (lower precision). The F1 score, $F_1 = 2PR/(P+R)$, takes the harmonic mean of the two and so penalises any model that performs poorly on either. Unlike a simple average, the harmonic mean drags strongly toward whichever number is smaller; a precision of 1.0 paired with a recall of 0.1 gives only $F_1 \approx 0.18$, not the misleading 0.55 that a plain mean would suggest. This makes F1 a fair single-number summary when both kinds of error matter.

Specificity, $\text{TN}/(\text{TN} + \text{FP})$, mirrors recall on the negative class. High specificity means the model rarely raises a false alarm against a healthy negative. Sensitivity and specificity together are the standard pair in clinical medicine, where false negatives and false positives have very different costs.

A worked example. You have 1000 emails. Of them, 100 are spam (the positive class) and 900 are legitimate. Your model flags 90 emails as spam. Of those 90, eighty are real spam (TP $=80$) and ten are wrongly accused (FP $=10$). Among the 910 emails the model passed through, 890 were genuinely fine (TN $=890$) and 20 were spam that slipped through (FN $=20$). Plug in the formulas. Precision is $80/(80+10) = 80/90 \approx 0.889$. Recall is $80/(80+20) = 80/100 = 0.8$. The F1 score is $2 \times 0.889 \times 0.8 / (0.889 + 0.8) \approx 1.422 / 1.689 \approx 0.842$. Accuracy, by contrast, is $(80 + 890)/1000 = 0.97$, a much rosier number that hides the missed spam entirely.

ROC and AUC

So far we have assumed the model returns a hard yes-or-no prediction. In reality, most classifiers output a continuous score, a probability between 0 and 1, say, and we choose a threshold above which to predict "positive". A different threshold gives a different confusion matrix. A natural question is therefore: how good is the model across all possible thresholds?

The receiver operating characteristic curve, or ROC curve, answers this by plotting the true-positive rate (recall) against the false-positive rate ($\text{FP}/(\text{FP}+\text{TN})$) as the threshold sweeps from 0 to 1. At a strict threshold, both rates are low; at a lenient threshold, both rates are high; the curve traces the trade-off. A perfect classifier hugs the top-left corner. A useless classifier, one that ranks cases no better than a coin flip, produces a diagonal straight line.

The area under the ROC curve (AUC) collapses the whole picture into a single number between 0 and 1. AUC of 0.5 means random; 1.0 means perfect ranking. There is a clean probabilistic interpretation: AUC equals the probability that a randomly chosen positive case scores higher than a randomly chosen negative one. AUC is therefore a measure of ranking quality, independent of any specific threshold. It is the right metric when the user of the model will set their own threshold based on their own costs, for instance, a fraud team that wants to investigate the top 1 percent most suspicious transactions.

AUC has a well-known weakness. When positives are extremely rare, the false-positive rate denominator is dominated by the enormous negative class, so even mediocre models look impressive. In that regime, the precision–recall curve, plot precision on the y-axis against recall on the x-axis as the threshold varies, is more informative, because both axes are sensitive to the rare positive class. Its summary scalar is AUC-PR, the area under the precision–recall curve. For a fraud detector with one positive in a thousand, AUC-PR will reveal weakness that AUC-ROC happily hides.

Multi-class metrics

When there are more than two classes, recognising digits 0–9, for example, or sorting documents into ten topics, the precision/recall/F1 trio still applies, but you must compute it per class and then aggregate. Three conventions dominate.

Macro-averaged F1 computes F1 separately for each class and then takes the plain mean. Every class contributes equally, regardless of how many examples it has. This is the right choice when minority classes matter as much as majority ones, for instance, recognising rare diseases.

Micro-averaged F1 pools the TP, FP, and FN counts across all classes and computes a single F1 from the totals. Frequent classes dominate the result, because they contribute more counts to the pool. This is appropriate when overall accuracy is what you really care about and minority classes are unimportant.

Weighted F1 is a compromise: per-class F1 averaged with weights proportional to class size. It reports something close to micro-F1 for accuracy-dominated tasks but preserves the per-class structure.

A different, very practical metric for multi-class problems is top-k accuracy: the prediction counts as correct if the true class is among the top $k$ scored classes. Top-1 accuracy is normal accuracy; top-5 accuracy is the standard reporting metric on ImageNet, where the difference between a Norfolk terrier and a Norwich terrier should not bring a system to its knees. Top-k is also natural for recommendation: was the right film among the five we suggested?

Language model metrics

Language models present new difficulties. There is no "right answer" for an open-ended generation; many translations are equally valid; many summaries are equally accurate. Several metrics have evolved to cope.

Perplexity measures how surprised the model is by held-out text. Formally, $\text{perplexity} = \exp(\text{cross-entropy loss})$. Lower is better. A perplexity of 50 means the model is, on average, as uncertain as if it were choosing uniformly from 50 possible next tokens at each step. Perplexity is the natural intrinsic metric for language modelling itself, but it tells you nothing about whether the generations are actually any good.

BLEU scores a generated translation by counting how many of its 1-, 2-, 3-, and 4-grams appear in a reference translation, with a brevity penalty to stop the model gaming the metric by producing very short outputs. It is precision-flavoured: of the n-grams I produced, how many were correct?

ROUGE is the recall-flavoured cousin: of the n-grams in the reference, how many did I produce? It dominates summarisation evaluation, where coverage of the source matters more than terseness.

chrF computes F-score on character n-grams rather than word n-grams. It is more robust for morphologically rich languages such as Finnish or Turkish, where word-level metrics break down because a single English word maps to many surface forms.

BERTScore moves beyond surface n-grams entirely, embedding both candidate and reference with a pretrained transformer and computing a similarity score between the embeddings. Two paraphrases with no shared words can still score highly, which is the whole point.

Exact match is the brutal yardstick used in question-answering benchmarks: the predicted string must equal the reference string, character for character. It is harsh, but it is unambiguous.

Ranking metrics

For search, recommendation, document retrieval, and retrieval-augmented generation (RAG), the model produces an ordered list, not a yes/no decision, and what matters is whether the right items appear near the top.

Normalised discounted cumulative gain (NDCG) sums the relevance of each retrieved item, but discounts items further down the list by a factor of $1/\log_2(\text{rank}+1)$. Highly relevant items at position 1 contribute almost their full relevance; the same item at position 20 contributes a fraction. The whole sum is normalised by the score of the perfect ordering, so NDCG always lies between 0 and 1.

Mean reciprocal rank (MRR) is even simpler. For each query, find the position of the first correct answer, take its reciprocal, and average across queries. A correct first answer scores 1; a correct second scores 0.5; nothing relevant in the top ten scores nearly nothing. MRR is the right metric when only the first hit really matters.

Mean average precision (MAP) averages the precision computed at every rank where a relevant item appears, then averages across queries. It rewards both retrieving relevant items and putting them near the top.

Calibration metrics

For probabilistic models, it is not enough to know whether the prediction was right. We also want the probabilities to be calibrated. A weather forecaster who says "70 percent chance of rain" should be right roughly seven times in ten when she says it.

Expected calibration error (ECE) measures this directly. Bin the predictions by their stated confidence, say, into ten bins of width 0.1, and within each bin compare the average predicted probability to the actual fraction of positives. ECE is the weighted average of these gaps. Smaller is better. ECE is pragmatic but bin-dependent, which is its main weakness.

The Brier score is the mean squared difference between predicted probability and the binary label: $\frac{1}{n}\sum(\hat p_i - y_i)^2$. It is a strictly proper scoring rule, meaning the only way to optimise it in expectation is to report your true probability beliefs.

The negative log-likelihood, $-\frac{1}{n}\sum [y_i \log \hat p_i + (1-y_i)\log(1-\hat p_i)]$, is also strictly proper, and is the same quantity you optimise during training as cross-entropy loss. It rewards both accuracy and calibration. Modern deep networks tend to be overconfident, predicting probabilities very close to 0 or 1, and a simple post-hoc fix called temperature scaling divides the model's logits by a learned constant before softmax to bring the probabilities back into line.

Choosing a metric

A few rules of thumb keep beginners out of the worst trouble.

When the classes are imbalanced, do not report accuracy alone. Prefer F1, ROC-AUC, or PR-AUC; they expose the rare-class failures that accuracy hides. When your regression target has outliers that should not dominate the score, prefer MAE or Huber loss to MSE. When the cost of different error types is unequal, a missed cancer is far worse than a false alarm, encode that in a weighted metric or report the expected cost in pounds, not abstract scores. And whatever metric you choose, always report it on held-out test data with confidence intervals; a single number with no uncertainty is a number that cannot be questioned, and any number that cannot be questioned cannot be trusted.

What you should take away

Accuracy is rarely the right metric. It is misleading whenever classes are imbalanced. Default instead to precision, recall, F1, or AUC, and pick by what kind of error matters to your application.
Different tasks demand different metrics. Regression error for continuous outputs, F1 for imbalanced classification, AUC for ranking, BLEU/ROUGE for text, NDCG for search, perplexity for language modelling, ECE for calibration. There is no universal score.
Precision and recall trade off against each other. You can usually buy more of one by giving up some of the other; F1 forces you to balance both.
AUC measures ranking, not threshold choice. It is the right metric when the user of the model will set their own threshold; it can flatter weak models on heavily imbalanced data, where AUC-PR is more informative.
Always report uncertainty. A test-set score on its own can be lucky or unlucky. Confidence intervals, typically computed via bootstrap resampling, distinguish a genuine improvement from sampling noise, and protect you from the most common form of self-deception in applied machine learning.