Chapter Ten

Training & Optimisation

Learning Objectives
  1. Explain stochastic gradient descent and its mini-batch variant, including the role of batch size
  2. Compare adaptive optimisers (Adam, RMSProp, AdaGrad) to vanilla SGD and describe when each is preferred
  3. Apply batch normalisation and related normalisation layers to stabilise deep network training
  4. Use regularisation techniques (dropout, weight decay, data augmentation) to reduce overfitting
  5. Design learning rate schedules (warmup, cosine decay, step decay) and tune hyperparameters systematically

You have built a neural network with millions of parameters. Now you need those parameters to take on values that make good predictions — not just on the training data, but on data the model has never seen. That is the training problem.

It sounds simple: compute the gradient, step downhill, repeat. In practice, it is one of the hardest parts of deep learning. The loss landscape is full of saddle points, narrow valleys, and flat regions. Your gradient estimates are noisy because you compute them on small batches, not the full dataset. And the choices you make — which optimiser, what learning rate, how much regularisation — can mean the difference between a model that works and one that fails to learn at all.

This chapter covers the practical toolkit for training neural networks. You will learn SGD and its variants, adaptive optimisers like Adam, batch normalisation, regularisation, learning rate schedules, and hyperparameter tuning.

10.1   Stochastic Gradient Descent

Batch Gradient Descent

In the simplest version, you compute the gradient of the loss over the entire training set and update: ww − η ∇w L, where η is the learning rate. This always follows the true steepest descent direction. But it is too slow for modern datasets. A single update requires a forward and backward pass through every example.

Mini-Batch SGD

SGD Robbins, 1951 estimates the gradient from a small random subset — a mini-batch of size B:

w L ≈ (1/B) Σi=1^B^ ∇w l(xi, yi; w)

This estimate is unbiased (its expected value equals the true gradient) but noisy. The noise decreases as the batch grows. In practice, batch sizes of 32 to 512 balance noise against GPU throughput.

Why Noise Helps

The noise in SGD is not just a cost — it is a feature. It acts as implicit regularisation, steering the optimiser away from sharp, narrow minima that generalise poorly. SGD with moderate noise preferentially finds flat minima — regions where the loss does not change much if you perturb the weights. Flat minima generalise better because the model's predictions stay stable when the data distribution shifts slightly between training and test time.

Momentum

Plain SGD can oscillate in narrow valleys. Momentum fixes this by accumulating a moving average of past gradients:

  • v ← μv − η ∇w L
  • ww + v

The momentum coefficient μ is typically 0.9 or 0.99. Think of a ball rolling downhill: momentum lets it coast through small bumps and shallow local minima instead of getting stuck. This simple addition often dramatically speeds up convergence.

Nesterov Momentum

Nesterov accelerated gradient (NAG) does a "lookahead" before computing the gradient. It first takes a tentative step in the momentum direction, computes the gradient at that lookahead point, and then corrects. This is especially effective late in training, when the optimiser is oscillating around a minimum and needs to slow down.

SGD Is Still First-Class

Despite fancier optimisers, SGD with momentum remains the default in many computer vision pipelines. It often finds solutions that generalise as well as or better than adaptive methods — though it may need more careful tuning.

10.2   Optimisers (Adam, RMSProp)

Vanilla SGD applies the same learning rate to every parameter. That is a problem. Parameters tied to rare features get infrequent but informative gradients. Parameters tied to common features get frequent but redundant updates. Adaptive methods give each parameter its own learning rate.

AdaGrad

AdaGrad divides each parameter's learning rate by the square root of the sum of all past squared gradients. Parameters with large past gradients get smaller updates. But the denominator grows forever, so the learning rate can shrink to near zero long before training finishes. This makes AdaGrad poorly suited to deep learning.

RMSProp

RMSProp Goodfellow, 2016 (Geoffrey Hinton, unpublished) fixes AdaGrad's decay by replacing the running sum with an exponentially weighted moving average:

s ← β2 s + (1 − β2) g^2^

Updates are divided by √s + ε, with β2 typically 0.999 and ε ≈ 10^−8^. By forgetting old gradients, RMSProp keeps its curvature estimate current and avoids learning rate collapse.

Adam

Adam Kingma, 2014 combines momentum with adaptive learning rates. It tracks two running averages:

  • First moment m (gradient mean, like momentum)
  • Second moment v (gradient variance, like RMSProp)

Both are initialised at zero, so they are biased early on. Adam corrects for this: = m / (1 − β1^t^) and = v / (1 − β2^t^). The update is:

ww − η / (√ + ε)

Defaults: β1 = 0.9, β2 = 0.999, ε = 10^−8^. Adam is popular because it converges fast and needs little tuning.

Adam's Weaknesses

Wilson et al. (2017) Wilson, 2017 showed that Adam can find solutions that generalise worse than well-tuned SGD, particularly in image classification. The adaptive rates can become too large for parameters with small second-moment estimates. This motivated AMSGrad (ensures the second moment never decreases) and AdamW (decouples weight decay from gradient scaling).

AdamW: The Modern Default

In standard Adam, L2 regularisation is scaled by the adaptive denominator, weakening it. Loshchilov and Hutter (2019) Loshchilov, 2017 proposed applying weight decay directly to the parameters:

w ← (1 − λη) w − η / (√ + ε)

This simple change improves generalisation. AdamW with cosine learning rate decay is now the standard recipe for training Transformers.

Other Optimisers

  • LAMB: extends Adam with per-layer learning rate scaling. Enables stable training with very large batch sizes for models like BERT.
  • Lion: discovered through programme search. Uses only the sign of the momentum, giving uniform updates and lower memory.

No single optimiser wins everywhere, but Adam and its variants are the most common starting point.

10.3   Batch Normalisation

The distribution of each layer's inputs shifts as upstream weights change. Ioffe and Szegedy (2015) Ioffe, 2015 called this internal covariate shift and proposed batch normalisation (BN) as the fix.

How It Works

For each neuron, BN computes the mean μB and variance σB^2^ of its activations across the current mini-batch, then normalises:

= (x − μB) / √(σB^2^ + ε)

Two learnable parameters — scale γ and shift β — produce the final output: y = γ + β. These let the network recover the identity transform if that is optimal.

Practical Benefits

  • Less sensitive to learning rate and initialisation: normalisation constrains activation magnitudes, so you can use higher learning rates.
  • Faster training: higher learning rates mean faster convergence.
  • Mild regularisation: normalisation statistics depend on which examples happen to be in the mini-batch, adding stochastic noise.

Why It Really Works

The original "covariate shift" explanation has been largely revised. Santurkar et al. (2018) Santurkar, 2018 showed that BN does not meaningfully reduce covariate shift. Instead, BN works by smoothing the loss landscape — reducing the Lipschitz constants of the loss and its gradient. This makes the optimisation better conditioned and permits larger learning rates.

This is a useful reminder: effective techniques can precede correct theoretical understanding.

Inference Mode

At test time, you cannot rely on mini-batch statistics (you may have a single example). The standard solution: maintain running averages of mean and variance during training and use those fixed values at inference. Failing to switch to these running stats is a common bug.

Alternatives

BN breaks down with very small batches or non-independent samples (object detection, video, reinforcement learning). Alternatives:

  • Layer normalisation Ba, 2016: normalises across all neurons within a single example. Batch-size-independent. The standard in Transformers.
  • Instance normalisation: normalises over spatial dimensions. Popular in style transfer.
  • Group normalisation: divides channels into groups, normalises within each.

BN remains the default for CNNs with large batches. Layer norm is the default for Transformers.

10.4   Regularisation in Deep Learning

Deep networks are so expressive that they can memorise training data, noise included. Regularisation constrains the model so it learns general patterns rather than fitting noise. Without it, training loss drops to near zero while test loss climbs. That is overfitting.

Weight Decay (L2)

Add a penalty proportional to the squared weights: Lreg = L + (λ/2) Σ wi^2^. The gradient of the penalty is λwi, pushing each weight toward zero. This discourages the network from relying on any single large weight.

With Adam, use decoupled weight decay (AdamW) to avoid the adaptive denominator weakening the regularisation.

Dropout

Dropout Srivastava, 2014 randomly sets each neuron's output to zero with probability p during training. Common values: p = 0.5 for hidden layers, 0.1–0.3 for input layers. Surviving activations are scaled by 1/(1 − p) so expected values stay the same (inverted dropout). At test time, all neurons are active.

Dropout trains an implicit ensemble of sub-networks that share weights. Each mini-batch sees a different random sub-network. The final model is an average over all of them. This is why dropout works so well in large fully connected layers.

Data Augmentation

Apply random transformations to training inputs: rotations, flips, crops, colour jitter, noise. This enlarges the effective training set and forces the model to learn features invariant to these transforms.

Advanced strategies:

  • Cutout: mask random rectangular patches.
  • Mixup Zhang, 2017: blend pairs of inputs and their labels.
  • CutMix: replace a patch of one image with a patch from another.

Data augmentation is essential in computer vision, especially on smaller datasets.

Early Stopping

Monitor validation loss during training. Stop when it starts to rise, even though training loss is still falling. The number of epochs acts as a complexity control. Early stopping is free — it needs no changes to the model — and nearly everyone uses it.

Label Smoothing

Label smoothing Szegedy, 2016 replaces hard one-hot targets with softened ones: assign 1 − α to the correct class and α/(K − 1) to each other class. Typically α = 0.1. This prevents overconfidence and improves calibration. Widely used in Transformer training.

10.5   Learning Rate Schedules

The learning rate is the single most important hyperparameter. Too high and the model diverges. Too low and training crawls. And the optimal rate changes during training: high early on for fast exploration, low later for precise convergence.

Step Decay

Reduce the learning rate by a fixed factor at set epochs. For example, start at 0.1 and divide by 10 at epochs 30, 60, and 90 (a standard ResNet recipe). Simple, but introduces extra hyperparameters.

Exponential Decay

Multiply by a fixed factor each epoch: ηt = η0 × γ^t^. Smoother than step decay, but can reduce the rate too fast.

Cosine Annealing

One of the most popular schedules Loshchilov, 2016:

ηt = ηmin + ½(ηmax − ηmin)(1 + cos(π t / T))

The rate drops slowly at first, faster in the middle, and slowly again at the end. This matches the observation that fine-grained tuning matters most in the final phase. Variants with warm restarts periodically reset the rate to let the optimiser escape bad basins.

Linear Warmup

During the first few hundred or thousand steps, linearly increase the learning rate from near zero to its target maximum. This is critical for Adam, whose second-moment estimates are unreliable early on. Without warmup, initial updates can be too large and destabilise training.

Linear warmup + cosine decay is the standard recipe for training large Transformers (GPT, BERT, ViT).

One-Cycle Policy

Smith's one-cycle policy Smith, 2015 linearly increases the rate in the first portion of training, then linearly decreases it back to a very low value. The high rate in the middle acts as a regulariser. The low rate at the end allows precise convergence. This often achieves equal or better accuracy in fewer total epochs.

Interactions

Learning rate interacts with batch size, weight decay, and dropout. The linear scaling rule says: if you double the batch size, double the learning rate (with enough warmup). Understanding these interactions is key to efficient training.

10.6   Hyperparameter Tuning

Model performance depends on many hyperparameters: learning rate, batch size, weight decay, dropout, architecture, augmentation pipeline. Unlike weights, these are set before training. The search space is vast and interactions are complex.

Grid Search

Try every combination of predefined values. Scales exponentially with the number of hyperparameters. Wastes many trials on unimportant dimensions.

Random Search

Bergstra and Bengio (2012) Bergstra, 2012 showed that random sampling is substantially more efficient. If only a few hyperparameters strongly affect performance (which is usually the case), random search explores those dimensions more thoroughly than a grid does.

Bayesian Optimisation

Build a probabilistic model (typically a Gaussian process) of how performance varies with hyperparameters. After each trial, update the model and use an acquisition function (expected improvement, upper confidence bound) to pick the next configuration. This balances trying near the current best (exploitation) with exploring uncertain regions (exploration).

Bayesian optimisation is especially effective when each trial is expensive. Libraries: Optuna, Hyperopt, BoTorch.

Multi-Fidelity Methods

Train each candidate for a few epochs. Drop the worst performers. Allocate more resources to survivors. Successive halving and Hyperband formalise this. Combined with Bayesian optimisation (BOHB), this finds good configurations quickly while continuing to improve with more budget.

Practical Advice

Start with well-established defaults:

  • Adam learning rate: ~1 × 10^−3^ or 3 × 10^−4^
  • Weight decay: 10^−4^ to 10^−2^
  • Dropout: 0.1 to 0.5

Run a coarse random search to find the important hyperparameters and their ranges. Refine with Bayesian optimisation. Transfer learning reduces the search space further, since fine-tuning a pretrained model needs far less tuning than training from scratch.

Reproducibility

Hyperparameter tuning raises fairness and reproducibility concerns. A heavily tuned model may appear to beat a baseline evaluated with defaults, even if the architectures are comparable. Best practice: document the search space, number of trials, and selection criterion so results can be compared and reproduced.