Frontier model training has scaled by roughly 10× compute per generation since 2018, a trajectory driven by scaling laws (Kaplan, Hoffmann/Chinchilla) showing that loss falls predictably as a power law in compute.
Compute milestones (training FLOPs, public estimates from Epoch AI):
- GPT-2 (2019): $1.5 \times 10^{21}$
- GPT-3 (2020): $3.1 \times 10^{23}$
- PaLM (2022): $2.5 \times 10^{24}$
- GPT-4 (2023): ~$2 \times 10^{25}$
- Gemini Ultra (2023): ~$5 \times 10^{25}$
- Llama 3.1 405B (2024): $3.8 \times 10^{25}$
- GPT-4.5 / Grok 3 / Claude 3.5 Opus class (2024–25): estimated $10^{26}$
- Next generation (2025–26): aiming at $10^{26}$–$10^{27}$
A single dense H100 delivers $\sim 10^{15}$ BF16 FLOP/s sustained; $10^{26}$ FLOPs therefore needs $10^{11}$ GPU-seconds, or about 3.2 million H100-hours. At 30 % MFU, about 10 million wall-clock H100-hours.
Cluster sizes:
- Llama 3 trained on 24,576 H100s in two co-located clusters of 16k and 24k GPUs.
- xAI Colossus (2024): 100,000 H100s in a single data centre in Memphis, expanded to 200,000 with H200/H100 mix and aiming at 1M.
- Microsoft–OpenAI Stargate (announced 2025): multi-gigawatt site, $100B+ capex, target operational 2028.
- Anthropic Project Rainier (2025): ~400k Trainium2 chips on AWS for Claude training.
Power: an H100 server (8 GPU + CPU + NIC + cooling) draws 10–14 kW; 100,000 H100s draws ~150 MW of IT power, with PUE-adjusted facility load of ~200 MW. The next generation (B200, 1 kW per GPU; GB200 NVL72, 120 kW per rack) drives multi-gigawatt sites for the largest training runs. For comparison, a typical nuclear reactor produces ~1 GW.
Order-of-magnitude per generation: this rate is consistent with both Moore-style cost-performance gains (3× per node) and algorithmic efficiency gains (3× per year, Epoch AI). Doubling every ~6–10 months in effective compute cannot continue indefinitely without either:
- New 100+ GW power infrastructure (gas turbines, nuclear PPAs, both Microsoft and Amazon have signed nuclear deals in 2024).
- Algorithmic breakthroughs reducing FLOP demand per capability unit.
- Hardware breakthroughs beyond CMOS (silicon photonics, optical compute).
Why it matters for the field: the gap between frontier-lab compute (~$10^{26}$ FLOPs) and academic compute (typically $10^{20}$–$10^{22}$) has grown to 4–6 orders of magnitude, locking academia out of frontier pre-training and concentrating capability research in a handful of labs (OpenAI, Anthropic, Google DeepMind, Meta, xAI, plus Chinese counterparts).
Related terms: Training-Cluster Economics, Power and Cooling, InfiniBand and RoCE, Inference Cost Economics
Discussed in:
- Chapter 15: Modern AI, Modern AI