Distributed Data Parallel, Glossary, Textbook of AI

Distributed Data Parallel (DDP) is the canonical data-parallel training strategy in PyTorch. Each of $N$ worker devices holds a full replica of the model parameters $\theta$ and processes a disjoint shard of the global mini-batch $B$. After the local backward pass each worker holds local gradients $g_i = \nabla_\theta \mathcal{L}_i$, and an all-reduce collective synchronises them so every worker ends the step with the averaged gradient

$$g = \frac{1}{N} \sum_{i=1}^N g_i.$$

Because all replicas start from the same initial weights and apply the same averaged gradient under the same optimiser state, parameter consistency is maintained without ever broadcasting weights, only gradients move on the wire.

Bucket-based communication is the key engineering trick. Naively launching one all-reduce per parameter tensor would incur thousands of small collectives per step, dominated by latency. DDP groups parameters into contiguous buckets (default 25 MB) and fires an asynchronous all-reduce on a bucket as soon as every gradient inside it is ready. This overlaps communication with the backward pass: while later layers are still computing gradients, earlier layers' buckets are already on the network. The hooks are registered on the autograd graph during construction, so no user code is needed beyond wrapping the model in DistributedDataParallel.

The communication cost per step for a model with $P$ parameters under ring all-reduce is

$$T_\mathrm{comm} \approx 2 \cdot \frac{P (N-1)}{N \cdot B_\mathrm{net}},$$

where $B_\mathrm{net}$ is the per-link bandwidth. The factor of two reflects the reduce-scatter and all-gather phases of ring all-reduce; the data volume per worker is approximately $2P$ regardless of $N$, which is why DDP scales well on bandwidth-rich interconnects like NVLink and InfiniBand.

DDP's memory footprint is the same as single-GPU training: a full copy of parameters, gradients, optimiser states (e.g. Adam's first and second moments, doubling the parameter memory) and activations on every device. For a 7B-parameter model in FP32 this is roughly 28 GB just for weights and gradients before optimiser states or activations. This is the central limitation that motivated fsdp and zero.

DDP does not shard anything; it is pure replication. It assumes the model fits on one device and uses extra devices only to consume more data per step. The effective batch size scales linearly with $N$, which interacts with optimiser hyperparameters: the linear-scaling rule sets the learning rate to $\eta \cdot N$ and uses a warm-up to stabilise the larger updates. Gradient accumulation can simulate even larger effective batches by averaging gradients over $k$ micro-batches before the all-reduce, trading wall-clock time for effective $B$.

DDP is the right default whenever a model fits on one accelerator. Beyond that point you either need to shard state (fsdp, zero) or to split the model itself across devices (tensor-parallelism, pipeline-parallelism).

Discussed in:

Chapter 15: Modern AI, Engineering at Scale

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).