Training & Optimisation: 10.3   Stochastic gradient descent

Dr Chris Paton

10.3 Stochastic gradient descent

Computing $\nabla L(\theta)$ exactly costs $N$ forward-backward passes. For $N = 10^9$ examples, this is intractable. Stochastic gradient descent (SGD) replaces the exact gradient with an unbiased estimate.

The Robbins–Monro setup

Robbins and Monro (1951) studied the more general problem of finding a root of the equation $\mathbb{E}[g(\theta, \xi)] = 0$ when only noisy samples $g(\theta, \xi)$ are available. They proved that the iteration

$$\theta_{t+1} = \theta_t - \eta_t\, g(\theta_t, \xi_t)$$

converges almost surely to a root provided

$$\sum_t \eta_t = \infty, \qquad \sum_t \eta_t^2 < \infty.$$

The first condition prevents premature stopping (you must accumulate enough learning rate to reach the root); the second damps the noise. Schedules satisfying both include $\eta_t = c/t^\alpha$ for $\alpha \in (1/2, 1]$.

In the deep learning case, $g(\theta, \xi)$ is the mini-batch gradient and $\xi$ indexes the batch.

Mini-batch gradients

Let $\mathcal{B}_t$ be a uniformly random subset of $\{1, \ldots, N\}$ of size $B$. The mini-batch gradient

$$\hat{g}_t = \frac{1}{B} \sum_{i \in \mathcal{B}_t} \nabla \ell_i(\theta_t)$$

has expectation $\mathbb{E}[\hat{g}_t] = \nabla L(\theta_t)$ (it is unbiased). Its variance, assuming sampling without replacement and ignoring the finite-population correction, is

$$\operatorname{Var}(\hat{g}_t) = \frac{\sigma^2}{B}, \qquad \sigma^2 := \frac{1}{N} \sum_{i=1}^N \|\nabla \ell_i - \nabla L\|^2.$$

The variance scales as $1/B$. Doubling the batch size halves the variance and quarters the standard deviation. This is the key trade-off: larger batches give cleaner gradients but cost more compute per step.

The signal-to-noise ratio

The gradient signal-to-noise ratio is $\mathrm{SNR} = \|\nabla L\|^2 / \operatorname{Var}(\hat{g})$. Early in training, $\|\nabla L\|$ is large and SNR is high, even tiny batches are informative. Late in training, $\|\nabla L\|$ is small and SNR drops; the noise can dominate the signal, which is when we typically reduce the learning rate or increase the batch size.

Why noise can help

If we wanted to minimise variance, we would always use full-batch gradient descent. So why is mini-batch SGD the workhorse of deep learning? Three reasons:

Computational efficiency. Each example contributes information; processing $B$ at a time on a GPU is far cheaper per example than processing one at a time. SGD spends compute where it counts.
Saddle-point escape. As discussed in §10.1, noise breaks ties at saddle points. A purely deterministic method can stall on a saddle; a noisy one cannot.
Implicit regularisation. Noise pushes the iterate towards flatter regions of the loss surface, which generalise better. We will make this precise in §10.14.

Key Principle

Mini-batch SGD update. For batch size $B$ and learning rate $\eta_t$,

$$\theta_{t+1} = \theta_t - \eta_t\, \hat{g}_t, \qquad \hat{g}_t = \frac{1}{B} \sum_{i \in \mathcal{B}_t} \nabla \ell_i(\theta_t).$$

The estimate is unbiased; its variance is $O(\sigma^2/B)$ where $\sigma^2$ is the per-example gradient variance.