Training & Optimisation: 10.15   Hyperparameter optimisation

Dr Chris Paton

10.15 Hyperparameter optimisation

Model performance depends on many hyperparameters: learning rate, batch size, weight decay, dropout, warmup steps, optimiser choice, architectural sizes, augmentation policy. Unlike weights, these are set before training and held fixed. The search space is high-dimensional, the loss surface is noisy (a single hyperparameter setting maps to a distribution over training runs), and each evaluation is expensive.

Grid search

Discretise each hyperparameter and try every combination. With $H$ hyperparameters and $g$ values each, the budget is $g^H$. Quickly intractable for $H > 3$. Wastes resources on unimportant axes (a hyperparameter that doesn't matter still gets $g$ trials).

Use grid search only for the final 1-2 most important hyperparameters after a coarse search has narrowed things down.

Random search

Bergstra and Bengio (2012) showed that random sampling in the hyperparameter space is consistently better than grid search. The argument is simple: if only $h$ of the $H$ hyperparameters matter, a grid wastes $g^{H-h}$ trials per important configuration; random sampling explores all $H$ axes with the same budget.

A standard recipe: sample each hyperparameter from a sensible prior (log-uniform for learning rate, uniform for dropout, etc.) and run 50–100 trials. The best 5–10 are then refined with a denser local search.

Bayesian optimisation

Build a probabilistic surrogate model of validation performance as a function of hyperparameters. After each trial, update the surrogate. Use an acquisition function to choose the next trial. Common surrogates: Gaussian processes (smooth, low-dimensional) or tree-structured Parzen estimators (TPE, used in Optuna). Common acquisition functions: expected improvement, upper confidence bound.

Bayesian optimisation excels when each trial is expensive, the overhead of fitting the surrogate is dwarfed by the cost saved on bad configurations. Libraries: Optuna, Ax (BoTorch), Hyperopt.

Successive halving and Hyperband

Multi-fidelity: instead of running each trial to completion, run $n$ trials for a fraction $r$ of the budget, drop the worst half, run the survivors for $2r$, drop half again, etc. Successive halving (Jamieson and Talwalkar 2016) does this with a single budget. Hyperband (Li et al. 2018) sweeps over different starting $(n, r)$ pairs to balance exploration of many configurations against thorough evaluation of few.

For deep learning, where training curves carry useful information after a few epochs, successive halving can find good configurations in a fraction of the cost of running all trials to convergence.

BOHB

BOHB (Falkner et al. 2018) combines Bayesian optimisation with Hyperband. Hyperband decides which configurations to advance; Bayesian optimisation decides which configurations to start. The result is sample-efficient (Bayesian) and budget-efficient (Hyperband).

Population-based training

Jaderberg et al. (2017): train many models in parallel. Periodically, copy the weights of the best performers to the worst, and perturb the worst's hyperparameters slightly. The best hyperparameters are discovered during training rather than via a separate search. PBT discovers schedules, the best hyperparameter configurations for late in training are different from the best for early in training, and PBT naturally captures this.

Practical advice

Start with defaults for the optimiser ($\eta = 3 \times 10^{-4}$ for AdamW with $\beta_1 = 0.9$, $\beta_2 = 0.999$ (LLM pretraining typically uses $\beta_2 = 0.95$ to mitigate loss spikes; the recipe in §10.17 reflects this), batch size 256, weight decay $10^{-2}$, warmup 1%).
Run a coarse random search over the most likely-impactful hyperparameters (learning rate, weight decay, dropout, possibly model size).
Refine with Bayesian optimisation or successive halving on the survivors.
Document the search space, number of trials and selection criterion so results are reproducible.
Beware of double-dipping: if you use the validation set to select hyperparameters, evaluate the final model on a held-out test set, not the validation set.