Training-Cluster Economics, Glossary, Textbook of AI

The headline figure for any frontier training run is dollars per FLOP delivered to the model. This is shaped by hardware acquisition cost, MFU (Model FLOPs Utilisation), electricity and cooling, networking, and amortisation.

Hardware cost: an H100 SXM module retails at ~\$30k; an HGX H100 8-GPU server with NVLink, NICs, NVMe and chassis runs ~\$300–400k; a 1024-GPU cluster (the smallest "real" training scale) is ~$50–80M including networking and storage.

MFU = (achieved throughput) / (peak FLOP/s × hours). Reported numbers:

GPT-3 (2020): ~21 % MFU on V100s.
PaLM (2022): 46 % on TPUv4.
Llama 2 (2023): 38 % on A100s.
Llama 3 (2024): 38–43 % on H100s (Meta paper).
Frontier H100 runs (2024–25): typically 30–50 %; anything above 50 % is exceptional.

The shortfall comes from communication overhead (all-reduce, all-gather), pipeline bubbles, idle SMs during memory-bound steps, and non-matmul work (layernorm, softmax, embedding lookups).

GPT-4 compute cost (estimated by Epoch AI): $2 \times 10^{25}$ FLOPs at ~\$2/H100-hour and ~32 % MFU implies **~\$63M of pure compute**. Total project cost, including failed runs, data, salaries, multiple training rounds, is more typically reported as $100M+.

DeepSeek-V3 (Dec 2024) reported **\$5.6M of compute** for the final 14.8 T-token run on a 2048-GPU H800 cluster: $2 \times 10^6$ H800-hours × \$2/h. The figure excludes prior R&D, infrastructure and synthetic data generation, and depends on internal pricing.

Operating cost dominates over the 3–4 year hardware lifetime:

Power: an H100 draws 700 W at peak; a server with 8 GPUs plus CPUs, NICs, fans draws ~10 kW. At a PUE of 1.3 (modern data centre) and **\$0.08/kWh** (industrial US rate), one server costs ~\$9k/year in electricity, ~$30k over four years, ~10 % of purchase price annually.
Cooling and facility: roughly 30 % uplift over IT power for traditional air cooling, 15 % for liquid; capex on the building is amortised at ~$10–15M per MW.
Networking: InfiniBand NICs and switches add 10–15 % of cluster cost; rebuilding a fabric to a new generation (HDR → NDR → XDR) every 3–4 years.
Depreciation: hyperscalers depreciate AI hardware over 5–6 years (recently extended); operators charge $2–4 per GPU-hour to recover this plus operating cost.

Trend: $/effective-FLOP for training has fallen ~3× per generation (P100 → V100 → A100 → H100), driven by die shrinks, lower-precision tensor cores, and HBM scale-out. The same dollar buys ~30× more BF16 FLOPs in 2024 than in 2018.

Discussed in:

Chapter 15: Modern AI, Modern AI

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).