ZeRO, Glossary, Textbook of AI

ZeRO (Zero Redundancy Optimiser, Rajbhandari et al., 2020) is the DeepSpeed family of memory-saving partitioning strategies that decompose data-parallel state across workers without changing the mathematics of training. Standard data parallelism replicates everything; ZeRO observes that most of this replication is redundant, since each worker only updates a fraction of parameters per step anyway. ZeRO turns replication into partitioning, in three progressive stages.

ZeRO-1: optimiser-state partitioning. The optimiser states (Adam's first moment $m$, second moment $v$, and the FP32 master weights) dominate memory in mixed-precision training, for Adam these are $12P$ bytes versus only $4P$ for BF16 parameters and gradients combined. ZeRO-1 shards optimiser states into $N$ pieces, so each worker stores $12P/N$ optimiser bytes and only updates its own slice. After the optimiser step, an all-gather broadcasts the updated parameter shards back to all workers. Memory drops by roughly $4\times$ for Adam-trained models, with no extra communication beyond the all-gather.

ZeRO-2: gradient partitioning. Beyond optimiser states, ZeRO-2 also shards gradients. Instead of an all-reduce that lands the full averaged gradient on every worker, it uses a reduce-scatter so worker $i$ receives only the slice $\bar{g}_i$ it needs to update its optimiser shard. Memory drops by another factor for gradients (saving $2P$ bytes in BF16), and total communication volume is unchanged: one reduce-scatter plus one all-gather equals one all-reduce.

ZeRO-3: parameter partitioning. The most aggressive stage shards parameters themselves, mirroring the design of fsdp. Each worker permanently holds $P/N$ parameters; before any layer's forward pass, an all-gather assembles the full layer parameters, which are then discarded after use. Memory per worker scales as $O(P/N)$ for parameters, gradients, and optimiser states combined, which is what enables 100B+ parameter models to train on commodity 8-GPU nodes.

The communication–memory trade-off is precise. ZeRO-1 and ZeRO-2 keep the same communication volume as DDP. ZeRO-3 doubles the per-step communication relative to DDP, roughly $3P$ bytes moved per worker versus $2P$, because parameters must be gathered once for forward and once for backward (with recomputation, this can be three gathers per step). The bandwidth penalty is the price of fitting the model.

ZeRO can be combined with CPU offload (ZeRO-Offload) and NVMe offload (ZeRO-Infinity), which push optimiser states to host RAM or SSD respectively. ZeRO-Infinity demonstrated training of 32 trillion parameters on a single DGX-2 node by streaming parameters from NVMe, a regime where I/O bandwidth, not compute, sets the step time.

ZeRO is mathematically identical to standard data-parallel SGD: the loss, gradients, and updates are the same; only the placement of state differs. This is what makes it a drop-in replacement, and what distinguishes it from pipeline-parallelism and tensor-parallelism, which restructure the computation itself.

Discussed in:

Chapter 15: Modern AI, Engineering at Scale

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).