Mixed precision training uses 16-bit floating-point arithmetic for most of the forward and backward passes while keeping critical state (master weights, optimiser moments, certain reductions) in 32-bit. The motivation is hardware: modern accelerators provide $2$–$4\times$ throughput and $2\times$ memory bandwidth on FP16/BF16 matmuls compared to FP32, and tensor cores on NVIDIA GPUs are optimised for 16-bit input with 32-bit accumulation. A correctly configured mixed-precision run reaches the same final loss as FP32 in roughly half the time and half the activation memory.
The three relevant 16-bit formats differ in how they trade range against precision:
- FP16 (IEEE half-precision): 1 sign, 5 exponent, 10 mantissa bits. Range $\approx \pm 6.55 \times 10^4$, machine epsilon $\approx 9.8 \times 10^{-4}$. High precision, narrow range, gradients of $10^{-8}$ flush to zero.
- BF16 (bfloat16): 1 sign, 8 exponent, 7 mantissa bits. Range matches FP32 ($\pm 3.4 \times 10^{38}$), epsilon $\approx 7.8 \times 10^{-3}$. Wide range, low precision, but no underflow on gradients.
- FP8: 1 sign, plus either 5+2 (E5M2) or 4+3 (E4M3) exponent/mantissa bits. Two formats are needed because forward activations and backward gradients have very different distributions.
BF16 has effectively replaced FP16 for training because its dynamic range matches FP32, eliminating the underflow problem that plagued FP16. The cost is reduced precision: 7 mantissa bits give roughly 3 decimal digits, so accumulated rounding error in long reductions can hurt convergence. This is why reductions (gradient all-reduce, softmax sum, layer-norm variance) are typically done in FP32 even when the operands are BF16.
For FP16 specifically, loss scaling is mandatory. Gradients in transformer training routinely fall below FP16's smallest normal of $6 \times 10^{-5}$, where they would underflow to zero. The fix is to multiply the loss by a large scalar $S$ before backward, scaling all gradients up by $S$, then divide them back down before the optimiser step:
$$\tilde{g} = \nabla(S \cdot \mathcal{L}), \qquad \theta \leftarrow \theta - \eta \cdot \tilde{g} / S.$$
Dynamic loss scaling (used by PyTorch's GradScaler) starts with a large $S$ and halves it whenever a NaN or Inf appears in the gradients, doubling it after a fixed number of clean steps. BF16 needs no loss scaling because its range covers gradients down to $10^{-38}$.
The master weights trick keeps a full FP32 copy of $\theta$ alongside the FP16/BF16 working copy. Optimiser updates $\theta_\mathrm{FP32} \leftarrow \theta_\mathrm{FP32} - \eta g_\mathrm{FP32}$ happen in FP32, then the result is cast back to BF16 for the next forward pass. This avoids the situation where $\eta g$ is too small relative to $\theta$ in BF16 to register as a change at all (a swamping problem when 7-bit mantissas meet small learning rates).
FP8 training is the active frontier (NVIDIA H100, Blackwell). It requires per-tensor or per-block scaling factors that are recomputed each step, so different parts of the network can use different exponents. Empirical work shows FP8 training matches BF16 loss curves on transformers up to hundreds of billions of parameters, with another roughly 2× throughput gain, making it the default for new training runs on H100/B200 hardware.
Related terms: Quantisation, Distributed Data Parallel, Fully Sharded Data Parallel, Adam
Discussed in:
- Chapter 15: Modern AI, Engineering at Scale