Glossary

TPU Systolic Array

The Tensor Processing Unit (TPU) is Google's family of custom ASICs for deep learning. Its defining feature is the systolic array, a two-dimensional grid of identical multiply-accumulate processing elements (PEs) through which operands flow in lock-step, a design first proposed by H.T. Kung in 1978.

How it works: in a $N \times N$ systolic array computing $C = A \cdot B$, the matrix $A$ is streamed in from the left, $B$ from the top. At each cycle, each PE multiplies the pair currently in its registers, accumulates the result, and shuffles its operands one cell to the right (for $A$) or down (for $B$). After $N$ cycles the array is full; it then produces $N^2$ multiply-adds per cycle. Crucially, each operand is loaded from off-chip memory once and reused $N$ times as it traverses the array, arithmetic intensity scales with array size.

Generations:

  • TPUv1 (2015): inference only, INT8, $256 \times 256$ MXU, 92 TOPS, 28 nm.
  • TPUv2 (2017): training, BF16, 45 TFLOP/s per chip.
  • TPUv3 (2018): 123 TFLOP/s, liquid-cooled.
  • TPUv4 (2021): 275 TFLOP/s BF16, optical circuit switches, 4096-chip pods.
  • TPUv5e (2023): cost-optimised inference, 197 TFLOP/s BF16, INT8.
  • TPUv5p (2024): training flagship, 459 TFLOP/s BF16, 8960 chips per pod.
  • Trillium / TPUv6e (2024): 4.7× compute over v5e, 32 GB HBM, 1.8 TB/s.

Pod-scale interconnect: TPU pods use a 3D torus topology (since v4: 2D torus + optical circuit switches that reconfigure the topology per job). Inter-chip bandwidth is on the order of 4.8 TB/s aggregate per chip for v5p. The "single device" abstraction in JAX (jax.devices() returns thousands of TPU cores) is enabled by this fabric and by XLA's SPMD partitioning.

Trade-offs vs GPUs: the systolic-array design wins on dense matmul efficiency and energy per FLOP for shapes that fit the array, and pods achieve very high all-reduce bandwidth via the torus. It loses flexibility, irregular sparsity, dynamic control flow, and mixed-shape kernels are easier on GPUs. TPUs are also only available via Google Cloud, with no on-premise option, which constrains adoption outside Google. Most of Google's frontier models, Gemini, PaLM, Gemma, are trained on TPU pods; almost everyone else trains on GPUs.

Related terms: Tensor Cores, AI Accelerator Landscape, InfiniBand and RoCE

Discussed in:

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).