Adam (Adaptive Moment Estimation), introduced by Kingma and Ba in 2015, is the most widely used optimiser in deep learning. It combines the ideas of momentum (exponential moving average of past gradients) with RMSProp (per-parameter learning rates based on gradient variance) into a single algorithm that is robust, fast, and requires relatively little tuning.
Adam maintains two exponentially weighted moving averages: the first moment $\mathbf{m}$ (gradient mean, like momentum) and the second moment $\mathbf{v}$ (uncentred gradient variance, like RMSProp). Both are initialised at zero and biased toward zero in early iterations, so Adam applies bias correction: $\hat{\mathbf{m}} = \mathbf{m}/(1 - \beta_1^t)$, $\hat{\mathbf{v}} = \mathbf{v}/(1 - \beta_2^t)$. The update rule is $\mathbf{w} \leftarrow \mathbf{w} - \eta \hat{\mathbf{m}} / (\sqrt{\hat{\mathbf{v}}} + \epsilon)$. Default hyperparameters are $\beta_1 = 0.9$, $\beta_2 = 0.999$, $\epsilon = 10^{-8}$.
Adam has become the default optimiser for most deep learning projects, valued for its robustness across architectures. However, Wilson et al. (2017) showed it can generalise worse than well-tuned SGD with momentum on certain tasks. AdamW, proposed by Loshchilov and Hutter (2019), decouples weight decay from the adaptive gradient scaling and has become the de facto standard for training transformers and large language models. A cosine learning rate schedule with warmup combined with AdamW is the standard recipe for training modern foundation models.
Related terms: Stochastic Gradient Descent, Learning Rate
Discussed in:
- Chapter 10: Training & Optimisation — Optimisers (Adam, RMSProp)
Also defined in: Textbook of AI