Pipeline Parallelism, Glossary, Textbook of AI

Pipeline parallelism partitions a deep network along its depth axis, placing a contiguous group of layers (a stage) on each device. Activations flow forward stage-by-stage and gradients flow backward in the reverse order, exactly as in a CPU instruction pipeline. This is orthogonal to ddp (which replicates layers across devices) and to tensor-parallelism (which splits individual matrix multiplies); the three can be composed into a 3D parallel mesh.

The naive scheme, feeding one mini-batch through stage 1, then stage 2, then stage 3, leaves all but one device idle at any moment, which defeats the purpose of using multiple devices. The fix is to split the mini-batch into $m$ micro-batches so that stages can work concurrently on different micro-batches. GPipe (Huang et al., 2019) introduced this with a synchronous schedule: all forward passes for all micro-batches first, then all backward passes, then one weight update. PipeDream (Narayanan et al., 2019) interleaves forward and backward passes in a 1F1B (one-forward, one-backward) schedule that holds at most $P$ in-flight micro-batches per stage, where $P$ is the number of stages. 1F1B has the same throughput as GPipe but uses far less activation memory because backward passes free activations promptly.

Even with $m$ micro-batches the pipeline cannot run at 100% utilisation, because the first stage finishes early at the start (filling the pipe) and the last stage finishes late at the end (draining it). The fraction of time wasted in these bubbles is

$$\frac{T_\mathrm{bubble}}{T_\mathrm{total}} = \frac{P - 1}{m + P - 1} \approx \frac{P - 1}{m},$$

so the bubble shrinks as the number of micro-batches grows. The standard rule of thumb is $m \geq 4P$, which keeps the bubble below 20%. Interleaved 1F1B (used in Megatron-LM) further reduces the bubble by assigning each device several non-contiguous "virtual" stages, raising the effective micro-batch count from $m$ to $m \cdot v$ where $v$ is the virtual-stage multiplier.

Pipeline parallelism couples cleanly with activation checkpointing: each stage stores only its input activations and recomputes intermediates during backward. Memory per stage is $O(L/P + m)$ activations rather than the $O(L \cdot m)$ a synchronous schedule would naively need, where $L$ is total layers.

The principal cost is load balancing. If layer compute is uneven (e.g. embedding and output projection are tiny, while transformer blocks are large), naive equal-layer partitioning leaves stages waiting on the slowest. Practitioners profile per-layer FLOPs and assign stages to equalise wall-clock per-stage time, sometimes putting only the embedding on stage 0 and merging blocks onto later stages.

Pipeline parallelism shines when inter-device bandwidth is low relative to compute. Only activations and gradients-of-activations cross the stage boundary, not parameters or full gradients, so the per-step volume is much smaller than tensor parallelism's. This makes pipeline parallelism the parallelism of choice across slow inter-node links (e.g. Ethernet or InfiniBand between nodes), with tensor-parallelism and fsdp reserved for fast intra-node links.

Discussed in:

Chapter 15: Modern AI, Engineering at Scale

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).