AI Accelerator Landscape, Glossary, Textbook of AI

The AI accelerator market in 2025 is dominated by Nvidia, with credible-but-distant challengers in three categories: (1) competing GPUs (AMD), (2) hyperscaler in-house ASICs (Google TPU, AWS Trainium, Microsoft Maia, Meta MTIA), and (3) novel architectures (Cerebras, Groq, Tenstorrent, Graphcore).

Nvidia (~90 % market share, 2024–25):

H100 / H200 (Hopper, 2022–24): 989 TFLOP/s BF16, 1979 TFLOP/s FP8.
B200 / GB200 NVL72 (Blackwell, 2024–25): 2.25 PFLOP/s BF16 dense, 9 PFLOP/s FP4 dense, 1.8 TB/s NVLink 5.
Software moat: CUDA (15+ years), cuDNN, cuBLAS, NCCL, Triton, TensorRT-LLM, Megatron, NeMo. Every major framework and serving system is CUDA-first.
Revenue: data centre segment ~$115B/year (FY2025), 78 % gross margin.

AMD:

MI300X (2023): 192 GB HBM3, 5.3 TB/s, 1307 TFLOP/s FP16. The HBM advantage made it briefly the best inference platform for 70B+ models.
MI325X (2024): 256 GB HBM3e, 6 TB/s.
MI350 / MI355X (2025): targets B200 parity at 288 GB.
ROCm software is improving but lags CUDA, kernel coverage, distributed training maturity, and ecosystem support.
Customers: Microsoft (large MI300X deployments for OpenAI inference), Meta, Oracle.

Google TPU: vertically integrated, only via Google Cloud. Trillium (v6e, 2024) delivers 4.7× v5e compute. Used for all Google internal training (Gemini) and accessible to GCP customers. See TPU Systolic Array.

AWS Trainium / Inferentia:

Trainium2 (2024): 1.3 PFLOP/s BF16, 96 GB HBM3, deployed in EC2 Trn2 UltraServers (16 chips, NeuronLink). Anthropic's Project Rainier uses ~400k Trainium2 chips.
Inferentia2: inference-optimised, used by Amazon's own Alexa, Search, and ad ranking.

Microsoft Maia 100 (2023): in-house 5 nm ASIC, deployed in Azure for OpenAI inference. Scale undisclosed but believed to be 100k+.

Meta MTIA v2 (2024): in-house inference ASIC for ranking and recommendation; some LLM inference. Not used for frontier training.

Apple Neural Engine: 16-core NPU on every M-series and A-series chip, ~38 TOPS INT8 on M4. On-device only, not a training accelerator, but the largest installed base of AI silicon by unit count.

Cerebras WSE-3 (2024): wafer-scale chip, 4 trillion transistors, 900,000 cores, 44 GB on-chip SRAM, 21 PB/s memory bandwidth. Optimised for sparse training and very-low-latency inference; CS-3 system runs Llama 3.1 70B at 450 tokens/s/user.

Groq LPU (2023): deterministic dataflow architecture, 230 MB on-chip SRAM, no HBM. Delivers >500 tokens/s on Llama 3 70B inference , by far the fastest commercial inference. Trade-off: each chip holds little weight, so models are sharded across hundreds of chips; capex per token of capacity is high.

Tenstorrent: Jim Keller's RISC-V-based AI accelerator (Wormhole, Blackhole). Open architecture, more flexibility, currently behind in software maturity and deployment scale.

Where each fits (rough rule of thumb):

Frontier training ($10^{25}+$ FLOPs): Nvidia, TPU, Trainium, only platforms with proven multi-thousand-chip scale-out.
Production inference of frontier models: Nvidia (default), AMD MI300X (memory-bound 70B+ models), Trainium2 (Anthropic), Maia (OpenAI on Azure).
Low-latency inference: Groq, Cerebras, SambaNova.
Edge / on-device: Apple Neural Engine, Qualcomm Hexagon, Google Tensor.
Research & long tail: AMD on-prem, Tenstorrent, Graphcore (declining).

The CUDA software moat remains the structural reason Nvidia's share is so concentrated; every challenger's roadmap is at least as much about software (compiler, kernel library, distributed runtime, framework integration) as it is about silicon.

Discussed in:

Chapter 15: Modern AI, Modern AI

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).