10.7 Newer optimisers

A handful of newer optimisers are worth knowing because they appear in modern training reports. None has displaced AdamW as the default, but each has a niche.

LAMB (Layer-wise Adaptive Moments)

You and colleagues (2020) extended Adam with per-layer learning rate scaling to enable very large batch sizes. The per-parameter Adam update is normalised by its own norm, then rescaled by the parameter norm:

$$u_t = \frac{\hat m_t}{\sqrt{\hat v_t} + \epsilon} + \lambda \theta_t$$ $$\theta_{t+1} = \theta_t - \eta\, \frac{\phi(\|\theta_t\|)}{\|u_t\|}\, u_t,$$

where $\phi$ is a clipping function. LAMB enabled BERT to be trained with batch size 32 768 in 76 minutes.

Lion (EvoLved Sign Momentum)

Chen et al. (2023) discovered Lion via programme search over optimiser space. The update is

$$c_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t$$ $$\theta_{t+1} = \theta_t - \eta\, [\operatorname{sign}(c_t) + \lambda \theta_t]$$ $$m_t = \beta_2 m_{t-1} + (1 - \beta_2) g_t.$$

Note the sign function: every parameter receives an update of magnitude exactly $\eta$, in the direction of the (interpolated) momentum. No second moment is tracked, Lion uses half the memory of Adam. On large vision transformers, Lion matches or exceeds AdamW with smaller learning rates ($\eta \approx 10^{-4}$ rather than $10^{-3}$).

Adafactor

Shazeer and Stern (2018) reduced Adam's memory cost. Adam stores $m_t$ and $v_t$, two extra tensors the size of $\theta$. For a 70B parameter model in FP32, that is 560 GB of optimiser state. Adafactor approximates $v_t$ for matrix-shaped parameters by storing only its row sums and column sums, $O(n + m)$ instead of $O(nm)$ for an $n \times m$ matrix. This was crucial for training T5.

Shampoo and SOAP

These are second-order methods that approximate the inverse Hessian via a Kronecker-factored preconditioner. Shampoo (Gupta et al. 2018) maintains separate preconditioners for each axis of a tensor parameter. SOAP (Vyas et al. 2024) is a more efficient variant that runs Adam in a rotated coordinate system defined by the Shampoo eigenvectors. On modern Transformer pretraining, Shampoo and SOAP can outperform AdamW per FLOP, but the implementation complexity has so far limited adoption.

Practical guidance

Use case Recommended optimiser
Default for Transformers AdamW
Pure CNN image classification SGD + Nesterov momentum
Very large batch sizes LAMB
Memory-constrained training Adafactor or Lion
Frontier-scale pretraining AdamW or Shampoo/SOAP

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).