NVLink and NVSwitch, Glossary, Textbook of AI

NVLink is Nvidia's high-bandwidth point-to-point interconnect between GPUs (and, on Grace systems, between GPU and CPU). Each NVLink generation aggregates many serial lanes into a single logical link.

Bandwidth by generation (per GPU, bidirectional aggregate):

NVLink 1 (Pascal P100, 2016): 160 GB/s
NVLink 2 (Volta V100): 300 GB/s
NVLink 3 (Ampere A100): 600 GB/s
NVLink 4 (Hopper H100): 900 GB/s across 18 links at 50 GB/s each
NVLink 5 (Blackwell B200): 1.8 TB/s across 18 links at 100 GB/s each

By comparison, PCIe Gen5 x16 delivers 128 GB/s, NVLink 5 is 14× faster. This is what makes tensor parallelism (splitting a single matmul across GPUs) practical: an all-reduce of activations during a forward pass would saturate PCIe but is comfortable on NVLink.

NVSwitch is the crossbar that turns point-to-point NVLink into all-to-all. Each NVSwitch chip has 64 NVLink 4 ports (or 72 NVLink 5 ports on Blackwell). An HGX H100 8-GPU baseboard uses 4 NVSwitch chips so any GPU can talk to any other at full 900 GB/s. The DGX H100 SuperPOD extends this with external NVLink switches across 32 nodes (256 GPUs) at full bandwidth.

GB200 NVL72 is the current limit: a single rack containing 72 Blackwell GPUs and 36 Grace CPUs, all connected by NVLink 5 through 9 NVSwitch trays. The 72 GPUs appear to software as one logical accelerator with 13.5 TB of unified HBM3e memory and 130 TB/s of NVLink fabric bandwidth. Inside the rack, every GPU pair talks at 1.8 TB/s; only between racks does the slower InfiniBand or Ethernet take over.

Why this matters: large language models exceed single-GPU memory by orders of magnitude. A 1 T-parameter model in BF16 needs 2 TB just for weights. Distributed training combines tensor parallelism (intra-node, NVLink-bound), pipeline parallelism (across stages), and data parallelism (gradient all-reduce, often InfiniBand-bound). The NVLink domain size, 8 GPUs on Hopper, 72 on Blackwell, sets the maximum practical tensor-parallel degree. Bigger NVLink domains let you run wider tensor parallelism and smaller pipeline parallelism, which reduces the bubble in pipeline schedules and improves end-to-end MFU.

NVLink is proprietary to Nvidia. AMD's equivalent is Infinity Fabric (~896 GB/s on MI300X), and there is an open consortium effort, UALink, aiming to standardise the layer.

Discussed in:

Chapter 15: Modern AI, Modern AI

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).