Training & Optimisation: 10.2   Gradient descent

Dr Chris Paton

10.2 Gradient descent

The simplest method imaginable. At each step,

$$\theta_{t+1} = \theta_t - \eta\, \nabla L(\theta_t),$$

where $\eta > 0$ is the learning rate (or step size). This is the discretisation of the continuous gradient flow $\dot\theta = -\nabla L(\theta)$ with step size $\eta$.

Why downhill?

By Taylor's theorem, for small $\eta$,

$$L(\theta - \eta g) \approx L(\theta) - \eta\, g^\top g + \tfrac{1}{2} \eta^2\, g^\top H\, g + O(\eta^3),$$

where $g = \nabla L(\theta)$ and $H = \nabla^2 L(\theta)$. The first-order term $-\eta \|g\|^2$ is strictly negative whenever $g \ne 0$, so for sufficiently small $\eta$, gradient descent decreases the loss. The second-order term is what limits how aggressive we can be: if $\eta$ is too large, the quadratic correction can swamp the linear decrease and we overshoot.

Smoothness and the maximum step size

We say $L$ is $\beta$-smooth if its gradient is $\beta$-Lipschitz: for all $u, v$,

$$\|\nabla L(u) - \nabla L(v)\| \le \beta \|u - v\|.$$

Equivalently, if $L$ is twice differentiable, $\|\nabla^2 L(\theta)\| \le \beta$ for all $\theta$. The descent lemma says that for $\beta$-smooth $L$,

$$L(v) \le L(u) + \nabla L(u)^\top (v - u) + \frac{\beta}{2} \|v - u\|^2.$$

Plugging in $v = u - \eta g$:

$$L(u - \eta g) \le L(u) - \eta \|g\|^2 + \frac{\beta \eta^2}{2} \|g\|^2 = L(u) - \eta\left(1 - \frac{\beta\eta}{2}\right)\|g\|^2.$$

The right-hand side is minimised at $\eta = 1/\beta$, giving

$$L(\theta_{t+1}) \le L(\theta_t) - \frac{1}{2\beta} \|\nabla L(\theta_t)\|^2.$$

This single inequality drives almost all convergence proofs for gradient descent.

Convergence in the convex case

Suppose $L$ is convex and $\beta$-smooth, with minimiser $\theta^\star$. Telescoping the descent inequality and using convexity gives

$$L(\theta_T) - L(\theta^\star) \le \frac{\beta \|\theta_0 - \theta^\star\|^2}{2T}.$$

The convergence rate is $O(1/T)$. Halving the gap requires doubling the iterations.

Strong convexity and exponential convergence

We say $L$ is $\alpha$-strongly convex if for all $u, v$,

$$L(v) \ge L(u) + \nabla L(u)^\top (v - u) + \frac{\alpha}{2}\|v - u\|^2.$$

Equivalently, $\nabla^2 L(\theta) \succeq \alpha I$. Strong convexity rules out flat directions in the loss surface. For an $\alpha$-strongly convex, $\beta$-smooth function with $\eta = 1/\beta$, gradient descent has linear convergence:

$$\|\theta_t - \theta^\star\|^2 \le \left(1 - \frac{\alpha}{\beta}\right)^t \|\theta_0 - \theta^\star\|^2.$$

The ratio $\kappa = \beta/\alpha$ is the condition number. Linear convergence means we halve the error every $O(\kappa)$ iterations, in contrast to the slow $O(1/T)$ rate of merely-convex problems.

Example

Worked example: quadratic. Let $L(\theta) = \tfrac{1}{2} \theta^\top A \theta$ with $A$ symmetric positive definite, eigenvalues in $[\alpha, \beta]$. Gradient descent with step $\eta$ gives $\theta_{t+1} = (I - \eta A)\theta_t$. The error contracts by a factor $\max_i |1 - \eta \lambda_i|$ per step. Setting $\eta = 2/(\alpha + \beta)$ minimises this and yields contraction $\frac{\kappa - 1}{\kappa + 1}$. So a problem with condition number $\kappa = 100$ contracts by about $0.98$ per step, roughly $T = 50\kappa = 5000$ steps for two decimal places of accuracy.

For deep networks neither convexity nor strong convexity holds. But the analytical machinery, descent lemma, smoothness, condition number, still informs every step-size choice we make. A learning rate that is "too high" is one that violates the descent lemma; a learning rate that is "too low" is one that wastes the smoothness budget.

Non-convex convergence

Without convexity we cannot prove $L(\theta_T) \to L(\theta^\star)$. We can, however, prove that the gradient norm goes to zero. From the descent inequality and telescoping,

$$\sum_{t=0}^{T-1} \|\nabla L(\theta_t)\|^2 \le 2\beta\, (L(\theta_0) - L^\star),$$

so $\min_{t < T} \|\nabla L(\theta_t)\|^2 = O(1/T)$, that is, $\|\nabla L\| \to 0$ at rate $O(1/\sqrt{T})$. We converge to a stationary point, but not necessarily a minimum.

Beyond first order: why we don't use Newton's method

A natural reaction to the slow convergence of first-order methods on ill-conditioned problems is to ask: why not use second-order information? Newton's method updates as

$$\theta_{t+1} = \theta_t - \eta\, [\nabla^2 L(\theta_t)]^{-1}\, \nabla L(\theta_t),$$

and converges quadratically near a minimum. The catch is the Hessian. For a network with $d = 10^9$ parameters the Hessian has $10^{18}$ entries; storing it is impossible and inverting it is more impossible still. Even Hessian-vector products $H v$, which can be computed in $O(d)$ time using an extra back-pass through the graph, are too expensive when the linear system $H \delta = g$ may need hundreds of conjugate-gradient iterations.

Several practical compromises have been explored: limited-memory quasi-Newton methods (L-BFGS), Hessian-free optimisation (Martens 2010), and Kronecker-factored approximations (K-FAC, Shampoo, SOAP, which we discuss in §10.7). For now, the dominant view is that storing curvature information per parameter as a diagonal, Adam's $v_t$, Shampoo's per-axis preconditioner, is the sweet spot of cost and benefit. Anything richer is rarely worth the engineering investment unless you operate at the very largest scales.

A worked example: the Rosenbrock banana

To make the geometry concrete, consider the Rosenbrock function

$$L(x, y) = (1 - x)^2 + 100\, (y - x^2)^2,$$

a classical 2D test problem with a curved valley running through the parabola $y = x^2$ and a unique minimum at $(1, 1)$. The Hessian along the valley floor has eigenvalues that vary by orders of magnitude. At $(0, 0)$, the eigenvalues are approximately $(2, 200)$, giving a condition number of 100. Plain gradient descent oscillates wildly across the valley walls; momentum lets us coast along the valley floor; Nesterov tightens the corner. AdaGrad scales the two coordinates differently and effectively whitens the problem.

Running gradient descent from $(-1, 1)$ with learning rate $0.001$ takes around $20\,000$ iterations to reach $(1, 1)$ to four decimal places. With momentum $\mu = 0.9$, around $5\,000$. With Adam at default settings and $\eta = 0.05$, around $500$. The Rosenbrock function is contrived, but it captures, in two dimensions, the behaviour deep networks display in millions: ill-conditioned curvature, narrow valleys, and the dramatic improvements that better optimisers offer.

The role of initialisation

Optimisation theory tells us where we end up given a starting point. It does not tell us how to choose the starting point. Initialisation is therefore as important as the optimiser. A bad initialisation (all zeros, all ones, large random values) fails not because the optimiser is bad but because the loss landscape is locally hostile to first-order methods.

The standard recipes are Xavier/Glorot initialisation (variance $2/(d_{\mathrm{in}} + d_{\mathrm{out}})$, designed to keep activation variances stable through linear+sigmoid layers) and He/Kaiming initialisation (variance $2/d_{\mathrm{in}}$, for ReLU networks). For Transformers, scaled initialisations like Megatron-style "small init" with standard deviation $\sigma = \sqrt{2/(5 d_{\mathrm{model}})}$ for residual-branch projections prevent activation magnitudes from blowing up with depth. The interaction between initialisation and the descent lemma is direct: initialisations that produce extreme activations correspond to large gradients, which violate any reasonable smoothness assumption and force tiny learning rates.