A tensor core is a hardware unit inside an Nvidia streaming multiprocessor (SM) that issues a fused matrix-multiply-accumulate of the form $D = A \cdot B + C$ on small dense tiles in a single instruction. Whereas a CUDA core executes one scalar fused multiply-add per cycle, a tensor core executes hundreds of multiply-adds per cycle by exploiting tile-level parallelism.
Volta (V100, 2017) introduced tensor cores operating on $4 \times 4 \times 4$ FP16 tiles, accumulating in FP32. Eight tensor cores per SM, 80 SMs, gave 125 TFLOP/s of FP16, a step change from Pascal's 15 TFLOP/s.
Turing (T4, RTX 20-series, 2018) added INT8 and INT4 modes for inference, doubling and quadrupling throughput respectively at the cost of dynamic range.
Ampere (A100, 2020) brought BF16 (8-bit exponent, 7-bit mantissa, same range as FP32, lower precision than FP16) and TF32 (a 19-bit format that drops into FP32 codepaths transparently). $16 \times 8 \times 16$ tile shape; 312 TFLOP/s BF16 at 400 W.
Hopper (H100, 2022) added the Transformer Engine with FP8 in two flavours, E4M3 (4-bit exponent, 3-bit mantissa, used for forward activations and weights) and E5M2 (5-bit exponent, 2-bit mantissa, wider range for gradients). Per-tensor scaling factors are tracked in hardware and updated each step. Peak: 989 TFLOP/s BF16, 1979 TFLOP/s FP8. Tile shape grows to $64 \times 256 \times 16$ via the new wgmma instruction acting on warpgroups (128 threads).
Blackwell (B200, 2024) introduces FP4 (E2M1, 4-bit) and second-generation structured sparsity (2:4, two non-zeros per four). Peak FP4: 20 PFLOP/s per B200 (sparse), 10 PFLOP/s dense. The NVL72 rack packages 72 B200s with NVLink switches into a single coherent domain at 1.4 EFLOP/s FP4.
Why the precision ladder works: training stability requires the dynamic range of BF16 or higher in critical paths (gradients, master weights), but most matmuls are robust to FP8 with per-channel scaling, and inference is robust to FP4 with weight calibration. Each halving of bit-width roughly doubles peak FLOP/s and halves HBM bandwidth pressure.
Programmer-visible API: at the CUDA level, tensor cores are accessed via wmma, mma.sync, wgmma, tcgen05 PTX instructions, or, far more commonly, via cuBLAS, cuDNN, CUTLASS templates, and frameworks (PyTorch, JAX) that call them. Kernels that bypass tensor cores (e.g. naive elementwise loops) leave 90%+ of the GPU's FLOPs unused.
Related terms: GPU Memory Hierarchy, Mixed Precision Training, Quantisation for Inference, FlashAttention Internals
Discussed in:
- Chapter 15: Modern AI, Modern AI