Training & Optimisation: 10.9   Batch size and the linear scaling rule

Dr Chris Paton

10.9 Batch size and the linear scaling rule

The linear scaling rule (Goyal et al. 2017):

When the batch size is multiplied by $k$, multiply the learning rate by $k$.

The idea is that $k$ steps of small-batch SGD with learning rate $\eta$ are approximately equal to one step of batch-$kB$ SGD with learning rate $k\eta$. Why? Consider $k$ small-batch updates:

$$\theta_{t+k} = \theta_t - \eta \sum_{j=0}^{k-1} g_{t+j}.$$

Treating $g_{t+j} \approx g_t$ for small $\eta$ (the parameters barely move), this is

$$\theta_{t+k} \approx \theta_t - k\eta\, g_t.$$

A single step at the larger batch size matches if we use learning rate $k\eta$.

The argument is heuristic, the gradients are not really constant across $k$ steps, but it works extremely well in practice, up to a point.

Critical batch size

McCandlish et al. (2018) showed that the linear scaling rule has a regime of validity: it holds up to a critical batch size $B^\star$, beyond which doubling the batch size no longer halves the time to a target loss. The critical batch size is approximately

$$B^\star \approx \frac{\sigma^2}{\|\nabla L\|^2 / \kappa},$$

a noise-to-signal ratio scaled by curvature. Above $B^\star$, gradient noise is no longer the bottleneck, the deterministic optimisation dynamics are. Increasing $B$ further improves the gradient quality but does not let you take bigger steps.

For ImageNet ResNet, $B^\star \approx 8000$. Beyond that, scaling breaks down.

When the rule breaks

Very early in training. Gradients are large, dynamics are not linear. Hence the need for warmup whenever $B$ is large.
Above the critical batch size. Linear scaling overshoots; you need a sub-linear rule (often $\sqrt k$).
With Adam-family optimisers. The adaptive denominator changes the picture; LAMB or careful tuning is needed for very large batches.

Practical advice: scale linearly up to a point that you verify empirically, with longer warmup as $B$ grows.