10.13 Pipeline, tensor and expert parallelism: the bubble overhead
It is worth quantifying the costs.
Pipeline bubble
With $K$ stages and $M$ micro-batches, total time is $(M + K - 1)\, t_{\mathrm{stage}}$ for a forward-only pipeline. Useful work is $M K\, t_{\mathrm{stage}} / K = M\, t_{\mathrm{stage}}$. Efficiency is
$$\eta_{\mathrm{pipe}} = \frac{M}{M + K - 1}.$$
To get $\eta_{\mathrm{pipe}} = 0.9$ with $K = 16$ stages we need $M = 135$ micro-batches per global batch. The activation memory cost is correspondingly high: we must store activations for all $M$ in-flight micro-batches at every stage.
Tensor parallelism communication
Each forward attention block needs one all-reduce on activations of size $B \cdot L \cdot d_{\mathrm{model}}$ (with $B$ the batch size, $L$ the sequence length). With ring all-reduce on NVLink, this takes roughly $\frac{2(K-1)}{K}\, \frac{B L d}{ \mathrm{bw}}$ seconds. For $B L d = 10^6$ in BF16 and NVLink at 600 GB/s, this is $\approx 6$ μs, negligible per layer but accumulates over 96 layers.
Expert parallelism all-to-all
The all-to-all moves $B \cdot L \cdot d_{\mathrm{model}}$ tokens of activation per forward pass. Cross-node all-to-all is one of the most expensive primitives in distributed training and is the chief bottleneck for very large MoE models.
Practical heuristic
Choose parallelism axes by communication intensity vs interconnect bandwidth:
- Tensor parallelism within a node (NVLink, 600 GB/s+).
- Pipeline parallelism across nodes within a rack (InfiniBand, 200 Gb/s).
- Data parallelism across racks.
- Expert parallelism only when you have very fast all-to-all (modern InfiniBand or NVL-rack).