Chapter Ten

Training & Optimisation

Learning Objectives
  1. Explain stochastic gradient descent and its mini-batch variant, including the role of batch size
  2. Compare adaptive optimisers (Adam, RMSProp, AdaGrad) to vanilla SGD and describe when each is preferred
  3. Apply batch normalisation and related normalisation layers to stabilise deep network training
  4. Use regularisation techniques (dropout, weight decay, data augmentation) to reduce overfitting
  5. Design learning rate schedules (warmup, cosine decay, step decay) and tune hyperparameters systematically

You have built a neural network with millions, perhaps billions, of parameters. Now you need those parameters to take on values that make good predictions, not just on the training set, but on data the model has never seen. That is the training problem, and it sits at the heart of every successful application of deep learning. The architecture matters; the loss function matters; the data matters. But unless you can reliably reduce the loss in a way that generalises, none of the other choices help.

The chapter that follows is unusually long because the topic is unusually deep. The optimisation problem we face, minimising a non-convex loss over an extremely high-dimensional parameter space, sits at the intersection of three rich theoretical traditions: classical convex analysis (Polyak, Nesterov, Bertsekas), stochastic approximation (Robbins, Monro, Kushner) and the empirical practice that has grown up around modern deep networks since the late 2000s. We must give each its due.

We begin by characterising the loss landscape itself: where the minima are, what saddle points look like, and why a noisy first-order method can navigate this terrain at all. We then derive gradient descent and prove a convergence rate for the convex case, before passing to stochastic gradient descent and its variants, momentum, Nesterov, AdaGrad, RMSProp, Adam, AdamW. After the optimisers we turn to the system-level concerns that dominate engineering: learning rate schedules, batch size scaling, gradient clipping, mixed-precision arithmetic, and the various flavours of distributed training (DDP, FSDP, ZeRO 1/2/3, pipeline and tensor parallelism). We close with the modern theoretical view of overparameterisation, the double-descent phenomenon and implicit regularisation, followed by hyperparameter search and a practical guide to debugging training. The chapter ends with a complete, runnable training loop in PyTorch that ties every idea together.

By the end you should be able to write down the update rule for any of the major optimisers from memory, know when each is preferred, design a learning rate schedule for a new model, diagnose pathological training runs from a few seconds of looking at the loss curve, and read a modern training paper without confusion.

In this chapter

This site is currently in Beta. Please get in touch via chrispaton.org with any suggestions, questions or comments.