Training & Optimisation: 10.18   Connecting back to the rest of the book

Dr Chris Paton

10.18 Connecting back to the rest of the book

We have come a long way from the simple update rule $\theta \leftarrow \theta - \eta \nabla L$. The training problem in deep learning is a tour of applied mathematics: convex analysis (Polyak, Nesterov), stochastic approximation (Robbins, Monro), numerical linear algebra (mixed precision, Shampoo's Kronecker factors), distributed systems (DDP, FSDP, NCCL all-reduce), and a generous helping of empirical engineering.

The chapter that follows on convolutional neural networks (Chapter 11) takes everything here for granted. When the ResNet paper says "trained for 600k steps with SGD momentum 0.9, weight decay $10^{-4}$, learning rate 0.1 dropped by 10× at steps 300k and 500k, batch size 256", you now know what every word means and could re-implement the recipe. When the GPT-3 paper says "AdamW with $\beta_1 = 0.9$, $\beta_2 = 0.95$, weight decay 0.1, cosine schedule with 375M-token warmup, gradient clipping 1.0, 3.2M-token batch size, mixed-precision FP16", same.

The frontier of the field continues to move. Optimisers like SOAP and Shampoo offer step-count savings of 30–50% over AdamW. Hardware-aware schedules (e.g. with longer warmup for ZeRO-3 to mask reduce-scatter latency) appear in newer training reports. Adaptive batch sizes that grow during training are an active research area. But the core principles, first-order methods on noisy gradients, learning rate schedules, regularisation by noise and weight decay, distributed memory partitioning, are stable, and they are what the rest of this book assumes.