Training & Optimisation: 10.7   Newer optimisers

Dr Chris Paton

10.7 Newer optimisers

A handful of newer optimisers are worth knowing because they appear in modern training reports. None has displaced AdamW as the default, but each has a niche.

LAMB (Layer-wise Adaptive Moments)

You and colleagues (2020) extended Adam with per-layer learning rate scaling to enable very large batch sizes. The per-parameter Adam update is normalised by its own norm, then rescaled by the parameter norm:

$$u_t = \frac{\hat m_t}{\sqrt{\hat v_t} + \epsilon} + \lambda \theta_t$$ $$\theta_{t+1} = \theta_t - \eta\, \frac{\phi(\|\theta_t\|)}{\|u_t\|}\, u_t,$$

where $\phi$ is a clipping function. LAMB enabled BERT to be trained with batch size 32 768 in 76 minutes.

Lion (EvoLved Sign Momentum)

Chen et al. (2023) discovered Lion via programme search over optimiser space. The update is

$$c_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t$$ $$\theta_{t+1} = \theta_t - \eta\, [\operatorname{sign}(c_t) + \lambda \theta_t]$$ $$m_t = \beta_2 m_{t-1} + (1 - \beta_2) g_t.$$

Note the sign function: every parameter receives an update of magnitude exactly $\eta$, in the direction of the (interpolated) momentum. No second moment is tracked, Lion uses half the memory of Adam. On large vision transformers, Lion matches or exceeds AdamW with smaller learning rates ($\eta \approx 10^{-4}$ rather than $10^{-3}$).

Adafactor

Shazeer and Stern (2018) reduced Adam's memory cost. Adam stores $m_t$ and $v_t$, two extra tensors the size of $\theta$. For a 70B parameter model in FP32, that is 560 GB of optimiser state. Adafactor approximates $v_t$ for matrix-shaped parameters by storing only its row sums and column sums, $O(n + m)$ instead of $O(nm)$ for an $n \times m$ matrix. This was crucial for training T5.

Shampoo and SOAP

These are second-order methods that approximate the inverse Hessian via a Kronecker-factored preconditioner. Shampoo (Gupta et al. 2018) maintains separate preconditioners for each axis of a tensor parameter. SOAP (Vyas et al. 2024) is a more efficient variant that runs Adam in a rotated coordinate system defined by the Shampoo eigenvectors. On modern Transformer pretraining, Shampoo and SOAP can outperform AdamW per FLOP, but the implementation complexity has so far limited adoption.

Practical guidance

Use case	Recommended optimiser
Default for Transformers	AdamW
Pure CNN image classification	SGD + Nesterov momentum
Very large batch sizes	LAMB
Memory-constrained training	Adafactor or Lion
Frontier-scale pretraining	AdamW or Shampoo/SOAP