ML Fundamentals: 6.8   Model selection

Dr Chris Paton

6.8 Model selection

We have a hypothesis class $\mathcal{H}$ that is partly fixed (the architecture family) and partly parameterised by hyperparameters: regularisation strength, network depth, learning rate, kernel bandwidth, number of trees, $k$ in $k$-NN. Model selection is the procedure for choosing the hyperparameters using the data.

K-fold cross-validation

The standard tool. Split the training data into $k$ equal folds. For each fold:

Hold it out as the validation set.
Train on the remaining $k-1$ folds.
Record validation performance.

The cross-validation estimate is the average of the $k$ validation scores. Common choices: $k = 5$ or $k = 10$. The extreme $k = n$ is leave-one-out cross-validation (LOOCV). LOOCV is nearly unbiased but has high variance (the $n$ trained models are highly correlated because they share $n-1$ training points), and it is expensive, $n$ retrains. For least-squares regression there is a closed-form formula (the PRESS statistic) that avoids the retraining cost.

Specialised variants

Standard random folding fails for several common data structures.

Stratified $k$-fold. Preserves the class proportions in each fold. Essential for imbalanced classification (otherwise some folds may have no positive examples).
Group $k$-fold. Keeps all observations from the same group (patient, household, factory) in the same fold. Otherwise, having two readings from the same patient in train and validation produces leakage and grossly optimistic estimates.
Time-series CV. Train on data up to time $t$, validate on $t+1, \dots, t+w$. Slide the window forward. Equivalent to walk-forward validation in finance. Never train on the future; never validate on the past.
Nested CV. Two loops. The inner loop selects hyperparameters on each outer fold. The outer loop estimates generalisation. Costly ($k_\text{outer} \times k_\text{inner}$ retrains) but the only way to honestly report performance after hyperparameter search.

Hyperparameter search

Three flavours.

Grid search. Try every combination on a discretised grid. Embarrassingly parallel; trivial to implement. Suffers exponentially in the number of hyperparameters: a 5-parameter grid with 5 values each is 3,125 evaluations.

Random search Bergstra, 2012. Sample combinations uniformly from the search space. Counter-intuitively, this beats grid search when only a few hyperparameters matter, because grid search wastes its budget exploring fine variations of irrelevant axes.

Bayesian optimisation. Maintain a probabilistic model (typically a Gaussian process) of the validation-loss surface as a function of hyperparameters; at each step pick the most promising next point by maximising an acquisition function such as expected improvement. The Spearmint Snoek, 2012 system showed dramatic gains over random search on deep network tuning, although for very high-dimensional spaces (hundreds of hyperparameters) tree-based surrogates such as TPE (Hyperopt) and SMAC scale better.

Modern systems combine Bayesian optimisation with early-stopping bandits (Hyperband, BOHB) that allocate compute to the most promising configurations rather than running every candidate to completion. For deep learning practitioners, libraries like Optuna and Ray Tune wrap these in a Pythonic API.

Worked example: scikit-learn `GridSearchCV` with nested CV

import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV, StratifiedKFold, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

X, y = load_breast_cancer(return_X_y=True)

pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("svm",    SVC(kernel="rbf"))
])

param_grid = {
    "svm__C":     np.logspace(-2, 3, 6),
    "svm__gamma": np.logspace(-3, 1, 5),
}

inner = StratifiedKFold(n_splits=3, shuffle=True, random_state=0)
outer = StratifiedKFold(n_splits=5, shuffle=True, random_state=1)

gs = GridSearchCV(pipe, param_grid, cv=inner, scoring="roc_auc")

# Nested CV: outer loop estimates generalisation;
# the inner loop (inside gs.fit) picks hyperparameters
nested_scores = cross_val_score(gs, X, y, cv=outer, scoring="roc_auc")

print(f"Nested CV AUC: {nested_scores.mean():.4f} +/- {nested_scores.std():.4f}")

The crucial detail: StandardScaler lives inside the pipeline, which means it is refit on the training fold each time. Putting it outside the pipeline would leak the test-fold means and standard deviations into training, a classic mistake we return to in §6.10.

6.8 Model selection

K-fold cross-validation

Specialised variants

Hyperparameter search

Worked example: scikit-learn GridSearchCV with nested CV

Worked example: scikit-learn `GridSearchCV` with nested CV