Learning Rate, Glossary, Textbook of AI

The Learning Rate, denoted $\eta$, is the step size used in gradient-based optimisation: $\mathbf{w} \leftarrow \mathbf{w} - \eta \nabla L$. It is arguably the single most important hyperparameter in deep learning. Set it too large and updates overshoot minima, causing oscillation or divergence. Set it too small and training proceeds very slowly, potentially getting stuck in suboptimal regions.

The optimal learning rate depends on architecture, optimiser, batch size, and data. Common starting points are $10^{-3}$ for Adam and $10^{-1}$ for SGD with momentum on computer vision tasks. A coarse learning rate range test, training briefly with exponentially increasing learning rates and observing where loss begins to explode, helps identify a reasonable upper bound.

Learning Rate Schedules adjust $\eta$ over the course of training. Step decay reduces the learning rate by a fixed factor at predetermined epochs. Exponential decay multiplies by a constant less than 1 each epoch. Cosine annealing follows a cosine curve from maximum to minimum and has become the default for transformers. Linear warmup, linearly increasing $\eta$ for the first few hundred or thousand steps, is almost always combined with another schedule, particularly for Adam where early-iteration second-moment estimates are unreliable. The one-cycle policy of Leslie Smith triangularly ramps up then down, often enabling faster training. Properly scheduling the learning rate is one of the most impactful things a practitioner can do to improve model quality.

Interactive

Gradient descent on a quadratic bowl. A ball rolls down a quadratic surface as the learning rate changes.

Learning rate schedules. Warmup, then cosine decay: the learning rate's path through training matters as much as its peak.

Discussed in:

Chapter 10: Training & Optimisation, Learning Rate Schedules

This site is currently in Beta. Please get in touch via chrispaton.org with any suggestions, questions or comments.