Adam, Glossary, Textbook of AI

Adam (Adaptive Moment Estimation), introduced by Diederik Kingma and Jimmy Ba in 2014, is the default optimiser of deep learning. For each parameter, Adam maintains running estimates of the first and second moments of the gradient:

$$m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t$$

$$v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2$$

where $g_t = \nabla L(\theta_{t-1})$ is the current gradient, and $\beta_1, \beta_2 \in [0, 1)$ control the decay rates (typical $\beta_1 = 0.9$, $\beta_2 = 0.999$). Bias-corrected estimates compensate for initialisation at zero:

$$\hat m_t = m_t / (1 - \beta_1^t), \quad \hat v_t = v_t / (1 - \beta_2^t).$$

The update rule is

$$\theta_t = \theta_{t-1} - \eta \frac{\hat m_t}{\sqrt{\hat v_t} + \epsilon}$$

with $\eta$ the learning rate and $\epsilon \approx 10^{-8}$ a small constant for numerical stability. Per-parameter scaling by $1/\sqrt{\hat v_t}$ gives larger updates to parameters with consistently small gradients and smaller updates to parameters with large gradients, empirically, this works well across a wide range of problems with little hyperparameter tuning.

AdamW (Loshchilov and Hutter, 2017) decouples weight decay from the gradient update, applying weight decay as a separate $-\eta \lambda \theta$ term rather than folding it into the gradient. AdamW is the standard for training Transformers.

Other refinements: Adafactor (Shazeer and Stern, 2018) reduces memory by factorising the second-moment matrix; LAMB (You et al., 2020) adds layer-wise adaptive learning rates for very large batches; Lion (Chen et al., 2023) uses only the sign of momentum, requiring half the memory; SOAP (Vyas et al., 2024) adds Shampoo-style preconditioning. Adam and AdamW remain the dominant choices in 2025.

Adam has known issues, non-convergence on some convex problems (Reddi et al., 2018) prompted AMSGrad; sensitivity to initial learning rate; sometimes worse generalisation than SGD with momentum. None has displaced it.

Interactive

Adam: per-parameter adaptive learning rates. Adam keeps a moving average of the gradient and the squared gradient. The ratio adapts each step.

Related terms: Gradient Descent, Stochastic Gradient Descent, diederik-kingma

Discussed in:

Chapter 10: Training & Optimisation, Training Optimisation

This site is currently in Beta. Please get in touch via chrispaton.org with any suggestions, questions or comments.