ML Fundamentals: 6.5   Optimisation in more depth

Dr Chris Paton

6.5 Optimisation in more depth

Most of machine learning, once the modelling decisions have been made, reduces to a single sentence: choose the parameters that minimise an empirical risk. Everything that follows, from a logistic regression fitted in milliseconds to a frontier language model trained for months across thousands of accelerators, is a variation on that one optimisation problem. The choice of optimiser is therefore not a peripheral implementation detail. It determines whether training converges at all, how quickly, to which point in parameter space, and, perhaps surprisingly, how well the resulting model generalises to data it has never seen.

This section covers the optimisation theory specific to machine learning. We separate the convex world, where classical guarantees apply and any local minimum is global, from the non-convex landscape of deep networks, where the picture is messier and the empirical track record is, against the odds, excellent. We then sketch the convergence rates that pin these regimes down, survey the adaptive optimisers that dominate large-scale training, and close on implicit regularisation, the now-famous observation that stochastic gradient descent does not merely minimise the loss but biases the search towards solutions that generalise. §6.4 examined the generalisation half of the picture; §6.5 is its optimisation counterpart, and both threads come together in Chapters 9 and 10 when we train deep networks in earnest. The calculus in Chapter 3 supplies the gradients we will be following.

Symbols Used Here

$\theta$parameters

$\mathcal{L}(\theta)$loss

$\eta$learning rate

$B$batch size

Convex vs non-convex

A function $\mathcal{L}(\theta)$ is convex when the line segment between any two points on its graph lies on or above the graph itself. Geometrically, the surface is bowl-shaped: there is a single basin, no ridges, no false floors. Algebraically, the Hessian is positive semi-definite everywhere it exists. The consequence for optimisation is profound. Every local minimum of a convex loss is also a global minimum, every stationary point with $\nabla \mathcal{L}(\theta) = 0$ is optimal, and gradient descent with a sensible step size is guaranteed to find that optimum. We can stop training when the gradient norm falls below a tolerance and be confident that we have not been trapped somewhere disappointing.

This is not a niche regime. Linear regression with squared error, logistic regression with cross-entropy, ridge and lasso regression, support vector machines with hinge loss, Poisson and other generalised linear models, the entire classical statistical toolkit is convex by construction. Convex optimisation is also a mature field with off-the-shelf interior-point and proximal solvers that exploit problem structure to converge in tens of iterations. If you can cast a problem as convex, you should: the practical and theoretical advantages are overwhelming.

Neural networks are different. The composition of linear maps and non-linear activations produces a loss surface that is non-convex in nearly every direction. There are many local minima, more saddle points than minima at high dimensions, long flat plateaus, narrow ravines, and regions where the gradient vanishes for reasons that have nothing to do with optimality. Worse, the loss is invariant under permutations of hidden units and under sign flips on layers paired with appropriate changes downstream, so genuinely distinct global minima exist in their millions. In principle, finding the global minimum of a non-convex loss is NP-hard.

In practice, this nightmare has not materialised. Empirically, gradient descent on neural-network losses finds solutions that are not necessarily global but are good enough, and, more strikingly, the local minima reached by stochastic gradient descent tend to generalise comparably regardless of the random seed. Several explanations have been advanced. Saddle points, not poor local minima, are the dominant obstacle in high dimensions, and small amounts of stochastic noise dislodge gradient flow from saddles efficiently. Most local minima of an over-parameterised network achieve near-zero training loss and lie in a connected, low-loss manifold rather than as isolated wells. The training objective is benign in ways that worst-case theory does not capture.

The practical lesson is to take convexity seriously when it is available (convex problems are solved problems) and to accept, when it is not, that we are doing applied alchemy with surprisingly reliable results.

Stochastic gradient descent

Computing $\nabla \mathcal{L}(\theta)$ over a training set with $n$ examples costs $O(n)$ per step. For $n$ in the millions and parameter counts in the billions, full-batch gradient descent is simply infeasible: a single step would take longer than we are prepared to wait, and we want thousands of steps. Stochastic gradient descent (SGD) sidesteps the problem by sampling a mini-batch $\mathcal{B}_t$ of $B$ examples uniformly at random and computing the gradient on that subsample only: $$ \theta_{t+1} = \theta_t - \eta \nabla \mathcal{L}_{\mathcal{B}_t}(\theta_t). $$ The mini-batch gradient is an unbiased estimator of the true gradient, its expectation is exactly $\nabla \mathcal{L}(\theta_t)$, but it is noisy, with variance scaling like $\sigma^2 / B$ for some problem-dependent constant. The computational saving is dramatic: we do roughly $n/B$ times less work per step, so we can afford that many more steps in the same wall-clock budget.

The sampling noise is not a bug. It is, on non-convex losses, very nearly a feature. Stochastic perturbations let the trajectory escape saddle points and sharp local minima that would trap deterministic gradient flow. The implicit-regularisation argument of Keskar and colleagues, returned to below, makes the case that SGD's noise actively biases the optimiser towards flat minima that generalise better. None of this would be useful if SGD did not also converge, and we will see in the next subsection that it does, slower than the full-batch method, but with vastly cheaper steps.

The choice of batch size $B$ is a practical lever. Small batches (say $B = 32$) maximise the noise and the implicit regularisation, but underuse modern accelerators. Large batches ($B = 4096$ or more, common in pre-training) saturate hardware throughput but sometimes generalise worse without careful learning-rate scaling. The linear scaling rule, $\eta \propto B$, restores comparable training dynamics up to a critical batch size beyond which increasing $B$ stops improving wall-clock time. Mini-batch SGD is the workhorse: it is the first algorithm any deep-learning practitioner reaches for, and many of the more elaborate methods discussed below are layered on top of it.

A subtler consequence of mini-batching is that we no longer optimise the empirical risk; we optimise a stochastic process whose stationary distribution depends on $\eta$, $B$, and the data. The point we converge to is not the minimiser of $\mathcal{L}$ but a noisy attractor near it, and the size of that attractor scales with $\eta / B$. This is one reason learning-rate schedules matter so much: cooling $\eta$ near the end of training shrinks the noise ball and lets the parameters settle.

Convergence rates

How fast does gradient descent reach the optimum? The answer depends on the geometry of the loss. Three regimes deserve to be on a single sheet.

Loss class	Algorithm	Rate	What "rate" measures
Convex, Lipschitz gradient	Full-batch GD	$\mathcal{L}(\theta_T) - \mathcal{L}^* = O(1/T)$	Sub-optimality of the loss
Strongly convex	Full-batch GD	$\mathcal{L}(\theta_T) - \mathcal{L}^* = O(c^T)$, $c < 1$	Geometric convergence
Convex	SGD (decaying $\eta$)	$\mathbb{E}[\mathcal{L}(\theta_T)] - \mathcal{L}^* = O(1/\sqrt{T})$	Sub-optimality in expectation
Strongly convex	SGD	$\mathbb{E}[\mathcal{L}(\theta_T)] - \mathcal{L}^* = O(1/T)$	Sub-optimality in expectation
Non-convex, smooth	SGD	$\min_{t \le T} \mathbb{E}[\\|\nabla \mathcal{L}(\theta_t)\\|^2] = O(1/\sqrt{T})$	Approach to a stationary point

A few features of the table deserve comment. For convex Lipschitz problems, full-batch gradient descent reduces sub-optimality at rate $1/T$, while SGD manages only $1/\sqrt{T}$, a real cost in step count. But SGD's per-step cost is $n/B$ times lower, so in terms of examples processed, SGD is essentially never worse and is usually better. Strong convexity, meaning the loss has positive curvature bounded below by a constant $\mu$, buys exponential convergence for full-batch GD and the faster $1/T$ rate for SGD. Ridge regression is strongly convex; ordinary least squares typically is not unless the design matrix is full rank.

The non-convex row is the one that matters for deep learning, and it is also the most modest. SGD on a smooth non-convex loss is only guaranteed to reach a stationary point, somewhere with small gradient norm, and it does so at rate $1/\sqrt{T}$. There is no guarantee that the stationary point is a minimum, let alone a global one. This is not a defect of the analysis; it is the price of dropping convexity. Stronger guarantees require additional structure (the Polyak–Łojasiewicz inequality, for instance, recovers linear convergence) or specific results about over-parameterised networks (Du, Allen-Zhu and colleagues showed that wide enough networks under SGD do converge to global optima with high probability). The take-home point is that the gap between convex and non-convex theory is real: convex problems come with rates and certificates; non-convex ones come with asymptotic statements about gradient norms and a great deal of empirical evidence that things turn out fine.

Adaptive optimisers

Plain SGD uses a single learning rate for every parameter. But the loss surface is rarely so co-operative: some directions are steep and demand small steps, others are shallow and could tolerate enormous ones. Adaptive optimisers maintain per-parameter learning rates that adjust during training based on the history of gradients each parameter has seen. The family started with AdaGrad, which divides each gradient component by the square root of its cumulative squared history, coordinates that have received large gradients in the past get scaled-down updates, while rare features keep effective learning rates. AdaGrad works well on sparse problems but, because the cumulative sum only ever grows, the effective learning rate decays to zero and training stalls.

RMSProp fixed this by replacing the cumulative sum with an exponential moving average, so that distant history is forgotten. Adam, introduced by Kingma and Ba in 2015, combined RMSProp's adaptive denominator with momentum's smoothed numerator and added bias-correction terms for early-step warm-up. The result is the de facto default optimiser of modern deep learning. AdamW decouples weight decay from the gradient, which matters when the adaptive denominator otherwise distorts the regularisation. Its defaults, $\beta_1 = 0.9, \beta_2 = 0.999, \varepsilon = 10^{-8}$, work on most problems out of the box.

The case for adaptive methods is largely empirical. They train Transformers and large language models robustly across a wide range of hyperparameters, where SGD with momentum often diverges or stalls. The case against is also partly empirical: in many computer-vision settings, well-tuned SGD with momentum achieves better final accuracy than Adam, particularly when strong weight decay and careful learning-rate schedules are involved. The community has tried to explain these regularities (sign-SGD interpretations of Adam, scale-invariance arguments, gradient-noise considerations) without fully settling the matter. Choose by experiment, not by reflex. We return to optimiser choice in Chapter 10 and to the underlying calculus in §3.9.

Implicit regularisation

The most surprising fact about SGD is one that took the field roughly a decade to articulate clearly. SGD does not merely minimise the empirical loss; it has an implicit bias towards solutions that generalise. Several distinct phenomena travel under this banner. On linearly separable data, SGD on logistic regression converges in direction to the maximum-margin classifier, even though no max-margin term appears in the loss, Soudry, Hoffer and colleagues proved this in 2018. On over-parameterised linear regression with multiple zero-loss solutions, SGD selects the minimum-norm solution. On deep networks, the empirical observation is that small-batch SGD tends to find flat minima, regions of parameter space where the loss varies slowly with $\theta$, while large-batch methods find sharper minima that train equally well but generalise worse.

The mechanisms are partially understood. SGD's update can be analysed as discretising a stochastic differential equation whose drift is the gradient flow and whose diffusion scales with $\eta / B$. Higher diffusion means the trajectory cannot settle into narrow wells, only into wide basins, and wide basins correspond to flat minima, which, by an information-theoretic argument originating with Hochreiter and Schmidhuber, generalise better because they describe the data with fewer effective bits. The argument is subtle (sharpness depends on parametrisation and is not invariant under reparameterisation) but the empirical signal is robust enough that practitioners design schedules around it: large initial $\eta$ to find a good basin, small final $\eta$ to settle into it.

Implicit regularisation has consequences beyond SGD itself. It blunts the "more parameters always overfits" intuition from §6.4, helping to explain why over-parameterised networks generalise at all. It makes the choice of optimiser part of the modelling decision, because Adam and SGD do not bias solutions in the same way. And it returns full circle to §6.13, where we treat implicit regularisation as a first-class topic. For now, the point is the modesty of what we can prove and the strength of what we observe: the optimiser does not just find a model; in a meaningful sense, it chooses one.

What you should take away

Most of machine learning is empirical risk minimisation. Optimiser choice shapes which member of $\mathcal{H}$ you end up with, and that choice is part of the modelling decision.
Convex losses (linear and logistic regression, SVMs) come with global guarantees and mature solvers. Non-convex losses (neural networks) lack guarantees but work in practice for reasons we partly understand.
SGD trades step quality for step cost: $1/\sqrt{T}$ convergence in the convex case, but $n/B$ times cheaper steps. On non-convex losses it converges only to stationary points, and that is enough.
Adaptive optimisers (Adam, AdamW, RMSProp) maintain per-parameter learning rates and dominate large-scale language-model training; well-tuned SGD with momentum still wins many vision benchmarks.
SGD is implicitly biased towards flat, low-norm, well-generalising solutions. The optimiser does not just minimise the loss; it chooses among many minimisers, and that choice is what makes deep learning work.