10.13 Pipeline, tensor and expert parallelism: the bubble overhead

It is worth quantifying the costs.

Pipeline bubble

With $K$ stages and $M$ micro-batches, total time is $(M + K - 1)\, t_{\mathrm{stage}}$ for a forward-only pipeline. Useful work is $M K\, t_{\mathrm{stage}} / K = M\, t_{\mathrm{stage}}$. Efficiency is

$$\eta_{\mathrm{pipe}} = \frac{M}{M + K - 1}.$$

To get $\eta_{\mathrm{pipe}} = 0.9$ with $K = 16$ stages we need $M = 135$ micro-batches per global batch. The activation memory cost is correspondingly high: we must store activations for all $M$ in-flight micro-batches at every stage.

Tensor parallelism communication

Each forward attention block needs one all-reduce on activations of size $B \cdot L \cdot d_{\mathrm{model}}$ (with $B$ the batch size, $L$ the sequence length). With ring all-reduce on NVLink, this takes roughly $\frac{2(K-1)}{K}\, \frac{B L d}{ \mathrm{bw}}$ seconds. For $B L d = 10^6$ in BF16 and NVLink at 600 GB/s, this is $\approx 6$ μs, negligible per layer but accumulates over 96 layers.

Expert parallelism all-to-all

The all-to-all moves $B \cdot L \cdot d_{\mathrm{model}}$ tokens of activation per forward pass. Cross-node all-to-all is one of the most expensive primitives in distributed training and is the chief bottleneck for very large MoE models.

Practical heuristic

Choose parallelism axes by communication intensity vs interconnect bandwidth:

  1. Tensor parallelism within a node (NVLink, 600 GB/s+).
  2. Pipeline parallelism across nodes within a rack (InfiniBand, 200 Gb/s).
  3. Data parallelism across racks.
  4. Expert parallelism only when you have very fast all-to-all (modern InfiniBand or NVL-rack).

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).