The Tensor Processing Unit (TPU) is Google's family of custom ASICs for deep learning. Its defining feature is the systolic array, a two-dimensional grid of identical multiply-accumulate processing elements (PEs) through which operands flow in lock-step, a design first proposed by H.T. Kung in 1978.
How it works: in a $N \times N$ systolic array computing $C = A \cdot B$, the matrix $A$ is streamed in from the left, $B$ from the top. At each cycle, each PE multiplies the pair currently in its registers, accumulates the result, and shuffles its operands one cell to the right (for $A$) or down (for $B$). After $N$ cycles the array is full; it then produces $N^2$ multiply-adds per cycle. Crucially, each operand is loaded from off-chip memory once and reused $N$ times as it traverses the array, arithmetic intensity scales with array size.
Generations:
- TPUv1 (2015): inference only, INT8, $256 \times 256$ MXU, 92 TOPS, 28 nm.
- TPUv2 (2017): training, BF16, 45 TFLOP/s per chip.
- TPUv3 (2018): 123 TFLOP/s, liquid-cooled.
- TPUv4 (2021): 275 TFLOP/s BF16, optical circuit switches, 4096-chip pods.
- TPUv5e (2023): cost-optimised inference, 197 TFLOP/s BF16, INT8.
- TPUv5p (2024): training flagship, 459 TFLOP/s BF16, 8960 chips per pod.
- Trillium / TPUv6e (2024): 4.7× compute over v5e, 32 GB HBM, 1.8 TB/s.
Pod-scale interconnect: TPU pods use a 3D torus topology (since v4: 2D torus + optical circuit switches that reconfigure the topology per job). Inter-chip bandwidth is on the order of 4.8 TB/s aggregate per chip for v5p. The "single device" abstraction in JAX (jax.devices() returns thousands of TPU cores) is enabled by this fabric and by XLA's SPMD partitioning.
Trade-offs vs GPUs: the systolic-array design wins on dense matmul efficiency and energy per FLOP for shapes that fit the array, and pods achieve very high all-reduce bandwidth via the torus. It loses flexibility, irregular sparsity, dynamic control flow, and mixed-shape kernels are easier on GPUs. TPUs are also only available via Google Cloud, with no on-premise option, which constrains adoption outside Google. Most of Google's frontier models, Gemini, PaLM, Gemma, are trained on TPU pods; almost everyone else trains on GPUs.
Related terms: Tensor Cores, AI Accelerator Landscape, InfiniBand and RoCE
Discussed in:
- Chapter 15: Modern AI, Modern AI