10.8 Learning rate schedules
The single most important hyperparameter in deep learning is the learning rate, and the second most important is its schedule. Good models can be trained with a poor schedule and a tuned LR, but the best models almost always use a non-trivial schedule.
Constant
$\eta_t = \eta_0$. The simplest possible schedule. Convergence theory predicts a noise floor $O(\eta_0 \sigma^2/B)$, the loss oscillates around a level proportional to $\eta_0$. Use only for pilot runs or when training time is too short to benefit from decay.
Step decay
$\eta_t = \eta_0\, \gamma^{\lfloor t/T_{\mathrm{step}} \rfloor}$. Reduce by factor $\gamma$ (typically $0.1$) every $T_{\mathrm{step}}$ steps. The classic ResNet recipe used $\eta_0 = 0.1$ and divided by $10$ at epochs $30$, $60$ and $90$ for a $100$-epoch ImageNet run. Step decay is simple but introduces the schedule itself as a hyperparameter.
Exponential decay
$\eta_t = \eta_0\, \gamma^t$. Smoother than step decay; the rate decreases continuously. Common in older RL pipelines. Tends to decay too fast if $\gamma$ is too small.
Cosine annealing
Loshchilov and Hutter (2017):
$$\eta_t = \eta_{\min} + \tfrac{1}{2}(\eta_{\max} - \eta_{\min})\left[1 + \cos\!\left(\pi\, \frac{t}{T}\right)\right].$$
The rate starts at $\eta_{\max}$, drops slowly initially, faster in the middle, and slowly again at the end. The shape matches an empirical observation: most of the optimisation work happens in the middle phase, while fine-grained refinement matters most near the end. Cosine annealing has displaced step decay as the default modern schedule.
Warm restarts (SGDR)
Cosine annealing with warm restarts periodically resets $\eta_t$ back to $\eta_{\max}$. Each restart kicks the optimiser out of any sharp local basin it has settled into. Schedules of the form $T_i = T_0\, \mu^i$ (each cycle longer than the last) are common. Useful for ensemble methods: the iterates at the end of each cycle make a diverse ensemble.
Inverse-square-root (Transformer schedule)
Vaswani et al. (2017) introduced for the original Transformer:
$$\eta_t = d_{\mathrm{model}}^{-1/2}\, \min(t^{-1/2},\, t \cdot t_{\mathrm{warmup}}^{-3/2}).$$
The schedule is linear warmup over $t_{\mathrm{warmup}}$ steps, then $1/\sqrt{t}$ decay. The $1/\sqrt{t}$ rate matches the optimal SGD step size for non-convex problems. The $d_{\mathrm{model}}^{-1/2}$ scaling makes the schedule depth-aware: deeper models, with larger gradients, need smaller learning rates.
Linear warmup
For Adam-family optimisers, the second-moment estimate $\hat v_t$ is unreliable in the very first steps. Without warmup, the bias-corrected estimate can produce huge updates that destabilise training. Linear warmup over the first $t_{\mathrm{warmup}}$ steps (typically $500$–$5000$, or about $1\%$ of total) ramps the learning rate from $0$ (or a small floor) up to $\eta_{\max}$:
$$\eta_t = \eta_{\max} \cdot \frac{t}{t_{\mathrm{warmup}}}, \qquad t \le t_{\mathrm{warmup}}.$$
For Transformers, warmup is essential. Skipping it almost always causes loss spikes or divergence early in training.
One-cycle policy
Smith (2017) proposed a single triangular cycle: ramp up over the first half of training, ramp down over the second half. The peak rate is set high enough that it itself acts as a regulariser. The final low rate gives precise convergence. Often achieves better accuracy in fewer epochs than cosine annealing for image classification.
Practical recipe
For modern Transformer training, the de facto standard is:
- Linear warmup over $\sim 1\%$ of total steps.
- Cosine decay from $\eta_{\max}$ to $\eta_{\min} \approx 0.1\,\eta_{\max}$ over the remaining $99\%$.
- $\eta_{\max} = 3 \times 10^{-4}$ for AdamW, scaled by the linear scaling rule for batch size.