Supervised Learning: 7.9   Model evaluation

Dr Chris Paton

7.9 Model evaluation

You need different metrics for regression and classification, and within each, several depending on the cost structure of your problem. Get this wrong and you optimise the wrong thing.

7.9.1 Regression metrics

For predictions $\hat y_i$ and targets $y_i$ on $n$ test points:

Mean squared error: $\text{MSE} = \frac{1}{n}\sum (y_i - \hat y_i)^2$. Punishes large errors quadratically. Differentiable, the natural log-likelihood for Gaussian noise. Sensitive to outliers.
Root mean squared error: $\text{RMSE} = \sqrt{\text{MSE}}$. Same units as $y$.
Mean absolute error: $\text{MAE} = \frac{1}{n}\sum |y_i - \hat y_i|$. The MLE for Laplace noise. Robust to outliers; non-differentiable at zero.
Coefficient of determination: $R^2 = 1 - \frac{\sum (y_i-\hat y_i)^2}{\sum (y_i - \bar y)^2}$. Fraction of variance explained. $R^2=1$ perfect, $R^2=0$ predicts the mean, $R^2<0$ worse than the mean.
Mean absolute percentage error: $\text{MAPE} = \frac{100}{n}\sum |y_i-\hat y_i|/|y_i|$. Scale-free; undefined at $y_i=0$, asymmetric.
Quantile / pinball loss: for forecasting prediction intervals.

When to use which. RMSE for symmetric Gaussian-ish errors and quadratic cost. MAE when outliers are real and not to be over-weighted. $R^2$ to compare across datasets; MAPE for scale-free comparisons (with the caveat about zero targets).

7.9.2 Classification metrics

Start with the confusion matrix for a binary classifier:

	Pred 0	Pred 1
Actual 0	TN	FP
Actual 1	FN	TP

From these four cells:

Accuracy $= (TP+TN)/(TP+TN+FP+FN)$. Misleading on imbalanced data.
Precision $= TP/(TP+FP)$. Of those predicted positive, what fraction were positive?
Recall (sensitivity, true-positive rate) $= TP/(TP+FN)$. Of all actual positives, what fraction did we catch?
Specificity (true-negative rate) $= TN/(TN+FP)$.
F1 score $= 2\cdot\frac{P\cdot R}{P+R}$. Harmonic mean of precision and recall.
Matthews correlation coefficient (MCC) $= (TP\cdot TN - FP\cdot FN)/\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}$. A correlation in $[-1, +1]$ that is meaningful even when classes are imbalanced.

Worked example: imbalanced data. A rare-disease screen has prevalence 1 in 1000. A "predict negative always" classifier gets 99.9 % accuracy and 0 recall. Accuracy is useless here; F1 (or MCC, or AUPRC) is informative.

ROC and AUC. Vary the decision threshold from 1 down to 0 and plot true-positive rate vs false-positive rate. The area under this curve, AUC-ROC, is the probability that a random positive scores higher than a random negative. AUC = 0.5 is random; AUC = 1 is perfect; AUC = 0 is perfectly inverted. AUC is threshold-free but can be misleading with extreme imbalance, a classifier with AUC = 0.95 might still have abysmal precision if positives are 1 in $10^6$. In that case, plot precision–recall curves and report the AUPRC.

Calibration. Among predictions with $\hat p\approx q$, what fraction are positive? A perfectly calibrated classifier has empirical positive rate $\approx q$ for every $q$. Logistic regression and correctly tuned random forests are well calibrated; boosted trees and SVMs typically are not. Reliability diagrams plot empirical vs predicted probabilities; the expected calibration error (ECE) summarises in one number. Re-calibrate via Platt scaling (sigmoid) or isotonic regression.

Cross-entropy and Brier score measure probabilistic prediction quality jointly with calibration: $\text{Brier} = \frac{1}{n}\sum(\hat p_i - y_i)^2$, lower is better.

7.9.3 Cross-validation

Holding out 20 % of a 1000-row dataset is wasteful. $k$-fold cross-validation splits the data into $k$ disjoint folds; train on $k-1$ and validate on the remainder, $k$ times. The mean validation error estimates generalisation. Stratified $k$-fold preserves class proportions, essential under imbalance. Leave-one-out is the limit $k=n$; time-series CV uses growing-window splits to avoid leakage from the future to the past. Nested CV is the gold standard for hyperparameter tuning: an outer loop estimates generalisation, an inner loop selects hyperparameters.