The Learning Rate, denoted $\eta$, is the step size used in gradient-based optimisation: $\mathbf{w} \leftarrow \mathbf{w} - \eta \nabla L$. It is arguably the single most important hyperparameter in deep learning. Set it too large and updates overshoot minima, causing oscillation or divergence. Set it too small and training proceeds agonisingly slowly, potentially getting stuck in suboptimal regions.
The optimal learning rate depends on architecture, optimiser, batch size, and data. Common starting points are $10^{-3}$ for Adam and $10^{-1}$ for SGD with momentum on computer vision tasks. A coarse learning rate range test—training briefly with exponentially increasing learning rates and observing where loss begins to explode—helps identify a reasonable upper bound.
Learning Rate Schedules adjust $\eta$ over the course of training. Step decay reduces the learning rate by a fixed factor at predetermined epochs. Exponential decay multiplies by a constant less than 1 each epoch. Cosine annealing follows a cosine curve from maximum to minimum and has become the default for transformers. Linear warmup—linearly increasing $\eta$ for the first few hundred or thousand steps—is almost always combined with another schedule, particularly for Adam where early-iteration second-moment estimates are unreliable. The one-cycle policy of Leslie Smith triangularly ramps up then down, often enabling faster training. Properly scheduling the learning rate is one of the most impactful things a practitioner can do to improve model quality.
Related terms: Gradient Descent, Stochastic Gradient Descent, Adam, Hyperparameter Tuning
Discussed in:
- Chapter 10: Training & Optimisation — Learning Rate Schedules
Also defined in: Textbook of AI