InfiniBand and RoCE, Glossary, Textbook of AI

Once a training job exceeds a single NVLink domain (8 GPUs on Hopper, 72 on Blackwell), GPUs must talk over a cluster fabric. The two dominant options are InfiniBand and RDMA over Converged Ethernet (RoCE).

InfiniBand is a switched-fabric standard maintained by the InfiniBand Trade Association and produced almost exclusively by Nvidia (Mellanox). Speed grades follow the SDR/DDR/QDR/FDR/EDR/HDR/NDR/XDR ladder:

HDR (200 Gb/s): dominant 2019–2022.
NDR (400 Gb/s): shipping 2022 onward; standard on H100 clusters via the ConnectX-7 NIC and Quantum-2 switch.
XDR (800 Gb/s): shipping with Blackwell via ConnectX-8 and Quantum-X800, used on GB200 NVL72-based pods.

RoCE v2 runs the same RDMA verbs over standard Ethernet (typically 400 GbE or 800 GbE), at the cost of needing careful PFC (priority flow control) and ECN (explicit congestion notification) tuning to avoid packet loss. Hyperscalers (Meta, Microsoft, AWS) have moved heavily to Ethernet-based fabrics, Meta's RSC trained Llama 3 over a custom 24k-GPU RoCE network, to escape the Nvidia networking premium and exploit existing Ethernet supply chains.

RDMA (Remote Direct Memory Access) is the key technology in both stacks. The NIC writes directly into a remote GPU's memory via GPUDirect RDMA, bypassing the CPU and host kernel. Latency drops from tens of microseconds (TCP/IP) to sub-microsecond, and CPU overhead goes from one core per connection to negligible.

Topology matters enormously at scale:

Fat-tree (Clos): full bisection bandwidth, three switch layers (leaf → spine → super-spine). Standard for jobs up to ~10k GPUs.
Dragonfly: hierarchical groups with all-to-all inside and sparser between, lower switch count, used at exascale HPC sites.
Rail-optimised: each GPU's NIC has a dedicated path through a separate switch plane, so the 8 GPUs in a node use 8 independent rails; minimises cross-rail congestion during all-reduce.

Bandwidth budget: a typical H100 server has 8 ConnectX-7 NICs at 400 Gb/s = 3.2 Tb/s (400 GB/s) of fabric bandwidth, about half of one GPU's NVLink budget, but applied across the whole cluster. For a 16k-GPU job, the all-reduce of a 70 GB gradient at every step takes gradient_size / per-GPU bandwidth × log(N) ≈ a few hundred milliseconds; this becomes the floor of step time for very large models and motivates gradient compression, overlapped communication, and asynchronous schedules.

Discussed in:

Chapter 15: Modern AI, Modern AI

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).