10.6 Adaptive learning rates
Different parameters need different learning rates. Embeddings of common words receive thousands of updates per epoch; embeddings of rare words receive few. A single global learning rate that suits the rare-word case will be glacial for common words; one that suits the common-word case will overshoot for rare words. Adaptive methods give each parameter its own rate.
AdaGrad
Duchi, Hazan and Singer (2011) introduced AdaGrad:
$$G_t = G_{t-1} + g_t \odot g_t$$ $$\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{G_t} + \epsilon} \odot g_t,$$
where $\odot$ is element-wise multiplication and the square root and division are also element-wise. $G_t$ is a running sum of squared gradients per parameter. Parameters with consistently large gradients get small effective learning rates; sparse parameters with rare large gradients get aggressive updates when those rare gradients arrive.
AdaGrad has a clean regret bound for online convex optimisation: $O(\sqrt{T})$ regret with no learning rate tuning required (the per-parameter normalisation handles it). For sparse problems (large vocabularies, click-prediction features) AdaGrad was a major advance.
The problem: $G_t$ only grows. Eventually the effective learning rate $\eta/\sqrt{G_t}$ becomes vanishingly small and training stalls. This makes AdaGrad poorly suited to deep networks, where we want to keep training for many epochs.
RMSProp
Hinton's solution (unpublished, taught in Coursera lectures around 2012) is to replace the running sum with an exponentially weighted moving average:
$$s_t = \beta_2\, s_{t-1} + (1 - \beta_2)\, g_t \odot g_t$$ $$\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{s_t} + \epsilon} \odot g_t.$$
With $\beta_2 = 0.999$, $s_t$ tracks a moving average of recent squared gradients with effective horizon $1/(1-\beta_2) = 1000$ steps. Old gradients are forgotten; the effective learning rate stays meaningful indefinitely.
RMSProp is essentially "AdaGrad with a leak" and was the standard adaptive method until Adam appeared.
Adam
Kingma and Ba (2015) combined momentum with RMSProp-style adaptive rates. Adam tracks two exponentially weighted moving averages: a first moment (gradient mean) and a second moment (gradient squared, like RMSProp).
$$m_t = \beta_1\, m_{t-1} + (1 - \beta_1)\, g_t$$ $$v_t = \beta_2\, v_{t-1} + (1 - \beta_2)\, g_t \odot g_t$$ $$\hat m_t = \frac{m_t}{1 - \beta_1^t}, \qquad \hat v_t = \frac{v_t}{1 - \beta_2^t}$$ $$\theta_{t+1} = \theta_t - \eta\, \frac{\hat m_t}{\sqrt{\hat v_t} + \epsilon}.$$
The defaults are $\beta_1 = 0.9$, $\beta_2 = 0.999$, $\epsilon = 10^{-8}$.
Bias correction explained
The bias correction $\hat m_t = m_t/(1 - \beta_1^t)$ is not cosmetic. It corrects a real statistical bias.
Initialise $m_0 = 0$. Then
$$m_t = (1 - \beta_1) \sum_{i=1}^t \beta_1^{t-i} g_i.$$
If we assume $g_t$ is drawn from a stationary distribution with mean $\mathbb{E}[g]$,
$$\mathbb{E}[m_t] = (1 - \beta_1)\, \mathbb{E}[g] \sum_{i=1}^t \beta_1^{t-i} = (1 - \beta_1^t)\, \mathbb{E}[g].$$
So $m_t$ is biased low by a factor $(1 - \beta_1^t)$. At $t = 1$ this factor is $1 - \beta_1 = 0.1$, a tenfold bias. Dividing by $(1 - \beta_1^t)$ undoes the bias and lets the optimiser take meaningful steps from the very first iteration. The same argument applies to $v_t$. As $t \to \infty$, $\beta_1^t \to 0$ and the correction becomes negligible.
Why Adam is popular
- Robust to learning rate. A wide range of $\eta$ values give similar performance.
- Per-parameter rates. Sparse parameters and dense parameters both train.
- Bias correction. The optimiser is well-behaved from step one.
- Default hyperparameters work. $\eta = 10^{-3}$ or $3 \times 10^{-4}$ is a near-universal starting point.
For Transformers, RNNs, and any model with mixed-density gradients, Adam is the default. For pure CNN image classification, well-tuned SGD with momentum sometimes generalises slightly better.
AdamW: decoupled weight decay
L2 regularisation adds a penalty $\tfrac{\lambda}{2}\|\theta\|^2$ to the loss. The gradient of the penalty is $\lambda \theta$. Naive Adam with L2 regularisation looks like
$$g_t \leftarrow g_t + \lambda \theta_t,$$
then applies the Adam update to $g_t$. The problem: the weight-decay term gets divided by $\sqrt{\hat v_t} + \epsilon$ along with everything else. Parameters with large second-moment estimates get less regularisation than those with small ones, the opposite of what we want.
Loshchilov and Hutter (2019) proposed decoupled weight decay:
$$\theta_{t+1} = \theta_t - \eta\, \frac{\hat m_t}{\sqrt{\hat v_t} + \epsilon} - \eta \lambda \theta_t.$$
The weight decay is applied directly to the parameter, not bundled into the gradient. It is genuinely a $\lambda$-fraction shrinkage per step, regardless of the second moment.
Empirically, AdamW with cosine learning rate decay improved results across the board. It is now the default optimiser for training Transformers and most modern architectures.
Worked example: bias correction matters. Suppose all gradients are exactly equal to $g$. Without bias correction at $t = 1$:
$m_1 = (1 - 0.9) g = 0.1 g$
The first step is ten times smaller than it should be. With correction: $\hat m_1 = m_1 / (1 - 0.9) = g$, the right size. Without bias correction, Adam would effectively need a higher initial learning rate, undoing some of its appeal.