Training & Optimisation: 10.1   The optimisation problem in deep learning

Dr Chris Paton

10.1 The optimisation problem in deep learning

Training a deep network is, on paper, the simplest thing in machine learning: pick a loss, compute its gradient, take a small step downhill, repeat. The mathematics fits on a postcard. The practice fills entire books, careers, and several billion-pound research budgets. A modern frontier model has hundreds of billions of parameters, is trained on trillions of tokens, costs tens of millions of pounds in electricity, and spends weeks on tens of thousands of GPUs. Every single one of those steps is just a gradient descent update. What makes deep learning training hard is not the optimisation rule but the loss surface, the scale, the noise, the numerical precision, the memory hierarchy, and the empirical art that has accumulated over a decade of large-scale runs.

The headline tension of this chapter is that deep learning's optimisation problem is, by every classical criterion, intractable. The objective is non-convex, the dimension is in the billions, the gradient is computable only stochastically, the Hessian is too large to store, and the conditioning is appalling. Classical optimisation theory promises us very little under these conditions. And yet stochastic gradient descent, a method designed in 1951 for far simpler problems, works well, provided we equip it with the right initialisation, normalisation, learning-rate schedule, momentum and adaptive scaling. Understanding why it works, and how to make it work reliably at scale, is the subject of the remaining sections.

This chapter covers the practice of optimisation in deep learning, agnostic to the architecture. The convex theory of Chapter 3 does not apply directly, but the algorithm transfers, modified, regularised and re-tuned.

Symbols Used Here

$\boldsymbol{\theta}$parameters

$\mathcal{L}(\boldsymbol{\theta})$loss

$\nabla \mathcal{L}$gradient

$\eta$learning rate

$T$total training steps

What this chapter covers

The chapter unfolds in roughly three movements. The first establishes the core algorithms. Section 10.2 returns to plain gradient descent and pins down its behaviour on the kinds of objectives we actually encounter. Section 10.3 introduces stochastic gradient descent, the workhorse of every deep learning system since AlexNet. Section 10.4 sketches the convergence theory we have for SGD on non-convex objectives, modest, but enough to understand the role of step size and noise. Section 10.5 adds momentum, the single most important modification to plain SGD; section 10.6 covers adaptive learning rates (RMSProp, Adam, AdamW), which dominate transformer training. Section 10.7 surveys newer optimisers, Lion, Sophia, Shampoo and friends, that occasionally displace Adam in particular regimes.

The second movement is about scheduling and conditioning. Section 10.8 covers learning-rate schedules: warmup, cosine, linear decay, the schedules that have become near-universal in large-model training. Section 10.9 examines batch size and the linear scaling rule, which connects compute-budget choices to optimisation behaviour. Section 10.10 handles gradient clipping and the practical management of gradient noise. Section 10.11 covers mixed-precision training and the bf16/fp16 distinction that lets billion-parameter models fit on a GPU in the first place.

The third movement is engineering at scale. Section 10.12 covers distributed training, data, tensor and pipeline parallelism, ZeRO, FSDP, the bandwidth budget. Section 10.13 takes apart the bubble overhead of pipeline and expert parallelism. Section 10.14 covers double descent and implicit regularisation, the surprising fact that overparameterisation makes optimisation easier, not harder. Sections 10.15 and 10.16 are practical: hyperparameter optimisation and how to debug a stalled training run. Section 10.17 walks through a complete from-scratch training loop you can run on a single GPU. Section 10.18 connects everything back to the architectural chapters that follow.

The training loop, in detail

Every deep learning system, from a hobbyist's MNIST notebook to GPT-4's training cluster, runs the same loop. It is worth writing out at the level of pseudo-code that a hardware engineer would recognise:

init θ                     # weights, biases, embeddings
init optimiser state       # momentum buffers, second moments
init schedule              # learning-rate trajectory η(t)

for step t in 1..T:
    sample minibatch B_t from data (with shuffling, sharding)
    forward:  ŷ = f_θ(x_B)
    loss:     L_t = mean ℓ(ŷ, y_B)         # plus regularisers
    backward: g_t = ∇_θ L_t                 # autograd / backprop
    all-reduce g_t across devices           # if distributed
    clip ‖g_t‖ if needed                    # cap exploding norms
    update optimiser state with g_t          # momentum, RMS, etc.
    η_t = schedule(t)                       # warmup, cosine, ...
    θ ← θ - η_t · û_t                       # apply update
    if t mod V == 0:
        evaluate validation loss
        save checkpoint if best

A few things deserve emphasis even at this level. Each step of the loop is short, milliseconds on a modern GPU for a small model, perhaps a second per step at frontier scale, but training runs for $10^5$ to $10^7$ steps, so the accumulated work is enormous. The optimiser state is additional memory: Adam needs two extra tensors per parameter, doubling or tripling weight memory before activations are even considered. The all-reduce step quietly dominates wall-clock time in distributed runs; section 10.12 explains why. The validation evaluation is what tells you whether you are learning rather than just memorising; section 10.16 explains how to read its trajectory.

The loop is short, but every line hides an essay. The forward pass touches activation memory and numerical precision (chapter 9). The backward pass is reverse-mode autodiff (chapter 3) plus the architectural specifics. The all-reduce is bounded by NVLink and InfiniBand bandwidth. The clip and the schedule are defended by the empirical literature. And the entire loop is wrapped in a checkpoint system that lets you resume after a node failure, at frontier scale, hardware fails roughly every few hours, and you cannot afford to start over.

Why training is hard

Six structural difficulties recur throughout this chapter. They are worth naming up front so the techniques in later sections feel less like a grab-bag and more like a coherent response to known obstacles.

Non-convex. The composition of ReLU layers, residual connections, attention and normalisation produces a loss surface with countless saddle points, plateaus, and a thin scattering of local minima. Classical convex-optimisation guarantees do not apply. We do not seek the global minimum; we seek any point on the manifold of low training loss that also generalises.
High-dimensional. With $d \in [10^9, 10^{12}]$, no human intuition survives. We cannot visualise the surface. We cannot store the Hessian. We cannot afford line searches. Every viable algorithm uses only the gradient or, at most, diagonal curvature surrogates.
Stochastic. The full-batch gradient is unaffordable at $N = 10^{12}$ tokens. We work with mini-batch estimates whose variance scales as $\sigma^2 / B$. That noise is partly a curse (it slows late-stage convergence) and partly a blessing (it pushes us off saddles and biases us towards flat minima) (section 10.14).
Numerical. Modern training runs in bf16 or fp16, with master weights in fp32 and occasionally activations in fp8. Each precision has its own dynamic range; loss scaling, careful normalisation, and gradient clipping exist to keep tensors inside the representable region. A single overflow can destroy a multi-week run.
Distributed. Parameters, gradients, optimiser state and activations are partitioned across thousands of accelerators connected by NVLink, NVSwitch, InfiniBand and Ethernet, in roughly that bandwidth order. Optimisation step time is bounded not by FLOPs but by communication. ZeRO, FSDP, tensor and expert parallelism are responses to specific bandwidth bottlenecks.
Memory-bound. For large models, activations, not weights, dominate memory. Activation checkpointing trades extra forward-pass FLOPs for less stored state. The optimiser footprint (momentum buffers, second moments) further squeezes memory; sharding the optimiser state across devices, as in ZeRO-1, is what lets a 70-billion-parameter model train on commodity nodes at all.

A useful rule of thumb: in 2026, the practical bottleneck of any new training run is, in decreasing order of likelihood, memory, communication bandwidth, numerical instability, optimisation pathology, and only finally the choice of optimiser. The optimiser theory in this chapter is necessary but rarely the limiting factor.

It is also worth noting how these difficulties interact. Stochastic noise interacts with numerical precision: a fp16 gradient with mean near zero can be dominated by quantisation error, which is why loss scaling (section 10.11) and bf16 exist. High dimensionality interacts with non-convexity: in low dimension, saddle points trap iterates, but in high dimension, almost every critical point is a saddle, and SGD's noise reliably escapes. Distributed training interacts with batch size: scaling out by data parallelism necessarily grows the effective batch, which changes the optimisation dynamics (section 10.9). Memory pressure interacts with optimiser choice: AdamW's two extra tensors per parameter cost more than they sound like they should when activations are also competing for HBM. None of these difficulties live in isolation, and most of the apparent complexity of modern training recipes is actually the geometry of these interactions.

How modern training has changed

It is instructive to track how training practice has shifted decade by decade. The optimisation theory is largely unchanged since the 1990s; what has moved by orders of magnitude is the engineering.

Pre-2015, the dominant recipe was SGD with momentum, careful Glorot or He initialisation, no normalisation or batch normalisation in CNNs, and step-decay learning rates tuned by hand. Models had millions of parameters; runs took days on a single GPU; everything fit in fp32 in a single device's memory.

Between 2015 and 2020, four shifts converged. Adam and its weight-decoupled variant AdamW became the default for almost every architecture except classical ResNets. Layer normalisation displaced batch normalisation in transformers, because batch statistics were unstable with very long sequences. Mixed-precision training in fp16, popularised by Nvidia's Apex and then folded into PyTorch and JAX, roughly halved the memory footprint and doubled effective compute. Cosine schedules with linear warmup became the de facto standard, replacing step decays.

From 2020 onward, the scale tipped from "lab" into "industrial". Batch sizes pushed past a million tokens. Schedules grew warmups of thousands of steps. ZeRO-1, ZeRO-2, ZeRO-3 and FSDP made it possible to train models whose parameters did not fit on any single GPU. Tensor parallelism (Megatron) and pipeline parallelism (GPipe, PipeDream, 1F1B) split the model itself across devices. Expert parallelism (Switch Transformer, GShard) added conditional computation. Bf16 displaced fp16 as the preferred 16-bit format because of its dynamic range. By 2024, fp8 training had become viable on Hopper-class hardware. DeepSeek-V3 (Dec 2024) trained 671 billion parameters end-to-end in FP8 mixed precision on H800s, the canonical proof point at frontier scale; NVFP4 / FP4 training on Blackwell is now an active research area. The optimisation algorithm at the centre of this entire stack is still SGD with momentum and adaptive scaling; the bones are unchanged. What has grown is the body.

A spectrum of training scales

The same loop runs across very different operating regimes, and the choices that are sensible at one scale are absurd at another. The table below sketches the four regimes most readers will encounter:

Scale	Compute	Examples
Personal	1 GPU, hours	Fine-tuning small models, MNIST, LoRA on a 7B model
Lab	8 GPUs, days	Medium models, individual research, small foundation models
Industry	100+ GPUs, weeks	LLM fine-tuning at production scale, vision foundation models
Frontier	10000+ GPUs, months	GPT-4, Claude, Gemini, DeepSeek-V3 scale

At the personal scale, almost any reasonable optimiser works; you can afford grid searches over learning rate; mixed precision is a convenience rather than a necessity; failure is cheap. Beginners often over-engineer here, importing distributed-training apparatus they do not need.

At the lab scale, hyperparameter sensitivity starts to bite. Eight GPUs of data parallelism makes the effective batch size large enough that the linear scaling rule (section 10.9) starts to matter. Mixed precision becomes essential to fit reasonable models. You begin to care about the difference between Adam and AdamW because weight decay interacts non-trivially with adaptive scaling.

At the industry scale, runs are expensive enough that you cannot afford a from-scratch hyperparameter sweep; you transfer settings from a smaller proxy run, perhaps with $\mu$P-style scaling rules. ZeRO and FSDP become standard. A single bad setting can waste weeks of GPU time, so the team writes regression tests for the loss curve. Failure recovery, checkpoint-and-resume, deterministic data ordering, is engineered carefully.

At the frontier scale, training is a logistical exercise as much as a scientific one. The optimisation algorithm is a fixed input to a system whose other variables (power, cooling, networking, hardware failure rates, supply-chain delivery of accelerators) dominate the schedule. The learning-rate schedule is chosen for robustness under restart, not for last-mile convergence. Gradient norms are watched on dashboards by engineers paged in the middle of the night. The optimisation theory of this chapter is necessary background, but the binding constraint at this scale is rarely a clever new optimiser; it is the fact that a hardware fault three weeks into a run can erase a fortnight of progress unless your checkpointing strategy is sound.

What you should take away

Deep learning training is a single loop (sample, forward, backward, update) repeated $10^5$ to $10^7$ times, and every advance you read about is a refinement of one of those four steps, not a replacement of the loop itself.
The optimisation problem is formally intractable (non-convex, high-dimensional, stochastic), but in practice tractable because overparameterisation flattens the landscape, most local minima are good, and SGD's noise escapes saddle points.
The bottleneck shifts with scale: at small scale, almost any optimiser works; at lab scale, hyperparameter sensitivity dominates; at industry and frontier scale, memory, bandwidth and fault tolerance bind long before the optimiser does.
Modern practice (AdamW, cosine-with-warmup, mixed precision, gradient clipping, FSDP/ZeRO) is a coherent response to specific obstacles, not an arbitrary stack of tricks. The remaining sections of this chapter explain each obstacle and the technique that addresses it.
The mathematical core has barely changed since the 1990s; the engineering has moved by six orders of magnitude. Knowing the difference between the two, what is theory, what is practice, is what separates someone who can read a training recipe from someone who can write one.