Warmup, then cosine decay: the learning rate's path through training matters as much as its peak.
From the chapter: Chapter 10: Training & Optimisation
Glossary: learning rate, warmup, cosine decay, training schedule
Transcript
The learning rate is the most important hyperparameter in deep learning. Its value at every point in training matters, not just the peak.
The simplest schedule is constant. The learning rate is set once and never changes.
A step decay drops the learning rate by a factor every few epochs. Cliffs in the loss curve align with the drops.
Warmup is now standard for large models. The learning rate starts near zero and rises linearly for the first few thousand steps. This gives the model a chance to settle before bigger updates begin.
Cosine decay smooths everything out. After warmup, the learning rate falls along a half-cosine, ending near zero by the final step.
Modern training recipes combine warmup with cosine decay. The warmup prevents early divergence, the cosine ensures the final updates are small enough to settle into a good minimum.
The right schedule can be the difference between a model that converges and one that does not.