Glossary

Tensor Parallelism

Tensor parallelism splits a single layer's computation across multiple devices by partitioning its weight matrices. Where pipeline-parallelism cuts the model across the depth axis (different layers on different devices) and ddp cuts across the batch axis, tensor parallelism cuts across the width axis: a 16384×16384 weight matrix becomes two 16384×8192 shards, each living on a different GPU. The technique was popularised by Shoeybi et al.'s Megatron-LM (2019) and is now standard for training and serving large transformer models.

The construction relies on two complementary partitioning patterns. A column-parallel linear layer splits the weight matrix along the output dimension:

$$W = [W_1, W_2], \qquad X W = [X W_1,\ X W_2],$$

so each device computes its own slice of the output independently from a replicated input $X$. No communication is needed for the matmul itself , the outputs naturally end up sharded along the output dimension. A row-parallel linear layer splits along the input dimension:

$$W = \begin{bmatrix} W_1 \\ W_2 \end{bmatrix}, \qquad X = [X_1, X_2], \qquad X W = X_1 W_1 + X_2 W_2,$$

where each device computes a partial sum from its shard of the input, and an all-reduce sums the partials into the final output. Communication happens once per row-parallel layer.

Megatron's transformer block chains a column-parallel layer feeding into a row-parallel layer, so that the output of the first stays sharded across devices and the input of the second arrives already sharded, only one all-reduce is needed at the end of the pair. The MLP $\mathrm{down}(\sigma(\mathrm{up}(x)))$ uses column-parallel for $\mathrm{up}$ and row-parallel for $\mathrm{down}$. Self-attention is structured similarly: $W_Q, W_K, W_V$ are column-parallel (so each device owns a contiguous subset of attention heads), and $W_O$ is row-parallel. Each transformer block thus incurs two all-reduces per forward pass and two more per backward pass.

The communication cost is significant. For activations of shape $(B, L, d)$, each all-reduce moves roughly $2 B L d$ bytes (in BF16), so an $N_l$-layer model with tensor-parallel degree $T$ incurs $4 N_l$ all-reduces per micro-batch step. This is why tensor parallelism is almost always confined to intra-node scope: NVLink (e.g. 900 GB/s on H100) makes the all-reduces tolerable, but extending tensor parallelism over inter-node links would crater throughput.

Tensor parallelism composes naturally with the other axes. The standard 3D parallel recipe used for 100B+ models is: tensor-parallel within a node (degree 8 on an 8-GPU box), pipeline-parallel across small groups of nodes, and data-parallel (fsdp or ddp) across the remaining mesh dimension. Megatron-DeepSpeed's training of GPT-3-scale models showed that this 3D mesh is essentially the only known way to keep MFU (model FLOPs utilisation) above 50% at trillion-parameter scale.

For inference, tensor parallelism reduces per-token latency by parallelising the per-layer matmul itself, which is why systems like vllm and TensorRT-LLM use it as their default scaling axis for large models.

Related terms: Pipeline Parallelism, Distributed Data Parallel, Fully Sharded Data Parallel, Transformer, Attention Mechanism

Discussed in:

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).