- Derive the Kaplan and Chinchilla scaling laws and use them to choose compute-optimal model size and training data
- Critically appraise claims of emergent abilities, distinguishing genuine phase transitions from metric artefacts
- Describe the modern pre-training recipe, data curation, tokenisation, training stack, curriculum, for a frontier language model
- Derive the RLHF objective from the Bradley–Terry preference model and explain the role of the KL penalty in PPO fine-tuning
- Derive Direct Preference Optimisation from the closed-form RLHF optimum and compare it with IPO, KTO, ORPO and SimPO
- Explain GRPO and how reasoning models such as DeepSeek-R1 are trained on verifiable rewards
- Use test-time compute, best-of-$N$, self-consistency, tree search, thinking tokens, to trade inference cost for accuracy
- Distinguish process reward models from outcome reward models and explain the result of Lightman et al. (2023)
- Implement retrieval-augmented generation, tool use and agentic loops, and reason about their failure modes
- Describe the multimodal frontier, vision–language, audio, video, embodied, and the state of evaluation in 2026
The decade from 2015 to 2025 was, in retrospect, the decade in which artificial intelligence stopped being a discipline mostly concerned with research benchmarks and became a piece of infrastructure that ran the world's writing, coding, customer support and an increasing share of its scientific reasoning. The Transformer arrived in 2017, GPT-3 in 2020, ChatGPT in 2022, GPT-4 in 2023, the first usable reasoning models in 2024, and by 2026 the frontier looked very different from anything that had preceded it. This chapter is a snapshot of where that journey reached as of April 2026, written from the vantage of someone who needs both to use these systems and to understand how they work.
The earlier chapters of this book covered the scaffolding: linear algebra, probability, optimisation, classical machine learning, neural networks, the Transformer. This chapter is concerned with what happens when you apply that scaffolding at the largest scale that humanity has ever pointed at a single model class. We start with the empirical scaling laws that governed the era. We move through the pre-training recipe, the alignment recipe, and the reasoning recipe. We discuss test-time compute (the suddenly-central idea that you can spend money at inference rather than training time). We cover tools, agents and retrieval. We end with a survey of the frontier as of early 2026 and an end-to-end recipe that any reader can run on a single GPU.
A note on style. Modern AI is a fast-moving field, and any chapter written about it is partly a hostage to fortune. We have tried to focus on the equations, the qualitative findings and the design principles, things that we expect to outlive the specific model names. Where we name a system, we name it because the design choice it embodies is instructive, not because that particular system is the latest.
In this chapter
- 15.1 The scaling era
- 15.2 Emergent abilities and the mirage critique
- 15.3 The pre-training recipe
- 15.4 Supervised fine-tuning
- 15.5 RLHF: from preferences to policies
- 15.6 DPO and the reward-free family
- 15.7 GRPO and reasoning-model training
- 15.8 Test-time compute scaling
- 15.9 Process supervision
- 15.10 In-context learning and few-shot
- 15.11 Chain-of-thought
- 15.12 Constitutional AI
- 15.13 Tools, function calling and agents
- 15.14 Retrieval-augmented generation
- 15.15 Multimodal models
- 15.16 The frontier as of April 2026
- 15.17 Open versus closed weights
- 15.18 Evaluation in 2026
- 15.19 End-to-end recipe
- 15.20 Inference and serving
- 15.21 Safety, interpretability, and the open questions
- 15.22 Where we are
- Exercises
- Solutions to selected exercises