GPU Memory Hierarchy, Glossary, Textbook of AI

A modern AI accelerator is fundamentally a memory hierarchy wrapped around compute. Performance depends far more on which level of the hierarchy data lives in than on raw arithmetic throughput.

High-Bandwidth Memory (HBM) is the bulk store. A frontier GPU carries 80 GB (H100 SXM), 141 GB (H200) or 192 GB (B200) of HBM3/HBM3e stacked next to the die, delivering 3.35 TB/s (H100) through 8 TB/s (B200). HBM is large but slow relative to the compute units; a single H100 can issue ~1979 TFLOP/s of FP16 tensor maths, which means roughly 310 FLOPs per byte of HBM bandwidth.

L2 cache is shared across all streaming multiprocessors (SMs): 50 MB on H100, 40 MB on AMD MI300X, with throughput in the tens of TB/s. Below it sits SRAM (per-SM "shared memory" plus L1) at 228 KB per SM on H100, 132 SMs giving ~30 MB on-chip total, but each SM's slice runs at multi-TB/s. Finally, the register file offers 256 KB per SM at zero-latency access.

Arithmetic intensity $I = \mathrm{FLOPs} / \mathrm{bytes}$ classifies a kernel against the roofline model: $$P = \min(P_{\mathrm{peak}},\; B \cdot I)$$ where $P$ is achieved throughput, $P_{\mathrm{peak}}$ peak compute, and $B$ memory bandwidth. A kernel below the ridge point $I^* = P_{\mathrm{peak}}/B$ is memory-bound; above it, compute-bound. For H100 FP16, $I^* \approx 590$ FLOPs/byte against HBM, but only $\approx 50$ FLOPs/byte against L2 and $\approx 5$ against SRAM.

This is why FlashAttention matters. Naive attention materialises the $N \times N$ score matrix in HBM, giving $I \propto 1$, hopelessly memory-bound. FlashAttention tiles $Q$, $K$, $V$ into blocks small enough to live in SRAM (typically $B_r = 64$, $B_c = 64$, occupying ~$3 \times 64 \times 64 \times 2 = 24$ KB per tile in BF16), recomputes softmax online, and never writes the full attention matrix back. Effective intensity climbs from ~1 to ~$d$ (the head dimension, e.g. 128), pushing the kernel into the compute-bound regime.

Practical implications: matrix multiplications with one small dimension stay memory-bound regardless of tensor-core throughput; batch-1 inference of an LLM moves $\sim 2N_{\mathrm{params}}$ bytes per token from HBM, so a 70 B-parameter model in BF16 needs 140 GB transferred per token, capping single-GPU throughput at $B / 140\,\mathrm{GB} \approx 24$ tokens/s on H100 even with infinite compute. Every serious AI kernel, attention, MoE routing, KV-cache reuse, is designed around this hierarchy.

Related terms: Tensor Cores, FlashAttention Internals, FlashAttention, KV Cache

Discussed in:

Chapter 15: Modern AI, Modern AI

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).