Modern AI: 15.1   The scaling era

Dr Chris Paton

15.1 The scaling era

A frontier AI system in 2024–2026 is not a single artefact. It is the end of a long industrial pipeline, in which a transformer is first immersed in trillions of tokens of human writing, then taught to follow instructions, then preference-tuned to be helpful and honest, then increasingly fine-tuned to reason, use tools, and operate as an agent. The result is what people now mean when they say "AI": a model such as Claude, GPT, Gemini, DeepSeek-R1, Llama or Grok, which converses fluently, writes code, solves multi-step problems, and grounds itself in retrieved documents or live tools.

This chapter is the synthesis. The previous fourteen chapters built the components: linear algebra, calculus, probability, statistics, classical machine learning, the perceptron, the multi-layer perceptron and backpropagation, convolutions, recurrence, attention, the transformer block, generative modelling and diffusion. Chapter 15 is what we do with all of it. We assemble these ingredients into a frontier system, examine the scaling laws that underwrote the assembly, and survey the systems that currently sit at the top of the field as of early 2026. Chapter 16 then takes up the safety, interpretability and ethical questions that follow once such systems are deployed at population scale.

Before we descend into derivations, it helps to step back and ask why this chapter is the longest in the book, and why it cannot be skipped. The transformer in §13 was a piece of mathematics. The generative models in §14 were probability distributions over data. Neither, on its own, is a useful product. To turn a transformer into something a clinician, a programmer or a student can actually rely on, we must spend ten million GPU-hours on raw text, then a much smaller but cleverer budget on supervised demonstrations, then a smaller budget still on preference learning, and finally we must wrap the whole apparatus in a serving stack that can answer in under a second. Each stage is a chapter on its own. §15.1 is the map of the territory.

The single most consequential discovery in deep learning between 2017 and 2022 was not an architecture. It was a graph. When you plot the cross-entropy loss of a transformer language model against compute, parameters or data on a log–log axis, you obtain a straight line. That line continues across many orders of magnitude. It does not bend. It does not saturate. It just keeps going. The empirical content of the scaling laws underwrote the entire LLM era. If you accept the line, then bigger really is better, and the only interesting questions become how to spend your compute optimally and what to do with the model once you have it.

Symbols Used Here

$T$token count for the training corpus, typically measured in trillions

$N$model parameter count, typically measured in billions

$C$total training compute, in floating-point operations (FLOPs)

The frontier model recipe

A modern frontier model is a transformer pushed through four sequential training stages. The recipe is the same across laboratories; what varies is the data, the budget and the post-training detail. Strip away the marketing and every model on the leaderboard fits this template.

Pretrain on ten to fifteen trillion tokens with next-token prediction. The objective is plain cross-entropy. Compute lands somewhere between $10^{24}$ and $10^{26}$ FLOPs depending on the laboratory. The corpus is a filtered, deduplicated mix of web crawl (predominantly CommonCrawl), books, scientific literature, source code (largely from GitHub), conversational data and, increasingly, multimodal pairs. This stage produces a base model: fluent, knowledgeable, but unbiddable. It will continue your prompt rather than answer it.
Supervised fine-tuning (SFT) on instruction–response demonstrations. Hundreds of thousands of high-quality examples, often hand-written by domain experts and contractors, teach the base model to follow instructions, refuse certain requests, format answers in markdown, and respect a system prompt. The compute here is tiny, perhaps a thousandth of pretraining, yet the behavioural change is enormous.
Preference alignment via RLHF, DPO or a close cousin. The model is shown pairs of its own outputs ranked by humans (or, increasingly, by a stronger model acting as judge) and is nudged to prefer the winners. This is the stage at which the polite, careful, helpful character emerges.
Optional capability post-training: reasoning RL with verifiable rewards (the route that produced o1, DeepSeek-R1 and Claude's extended thinking), tool-use fine-tuning, retrieval grounding, and agentic scaffolding for long-horizon tasks.

Each stage uses orders of magnitude less compute than the one before it but contributes outsized capability. The dataset for stage one is web-scale; for stage two, expert-curated; for stage three, preference-graded; for stage four, verifier-graded. This decreasing-data, increasing-quality curve is the central craft of a modern training pipeline, and we shall walk through each stage in detail across the rest of the chapter.

It is worth emphasising how recently this recipe stabilised. As late as 2021, the dominant view was that bigger pretraining alone would carry the field; alignment was a research curiosity. The release of InstructGPT in early 2022, and then of ChatGPT in November 2022, demonstrated that supervised and preference-based post-training were not optional polishing, they were the difference between a research artefact and a product used by hundreds of millions of people. Within eighteen months every serious laboratory had converged on the four-step pipeline. By 2024 the open-weights community had reproduced it in full; by 2025 the verifiable-reward reasoning stage had been added on top. The recipe is now stable enough that we can describe it linearly, and the rest of this chapter does exactly that.

Scaling laws (preview)

Kaplan et al. 2020 and Hoffmann et al. 2022 established that the pretraining loss falls as a power law in compute, parameters and data. The full derivations and disagreements appear in §15.1's later subsections and again in §15.2; for the synthesis here we need only the headline results.

Kaplan's recommendation, given a compute budget $C$, was to make the model very large and undertrain it: roughly $N \propto C^{0.73}$ and $T \propto C^{0.27}$. This produced GPT-3, 175 billion parameters trained on 300 billion tokens, and the dozens of similarly-shaped models that followed. Two years later Hoffmann et al. revisited the experiment with better learning-rate schedules and a wider sweep of model sizes, and arrived at a sharply different conclusion. For compute-optimal training,

$$ N_{\text{opt}} \propto C^{a}, \qquad T_{\text{opt}} \propto C^{b}, \qquad a \approx b \approx 0.5, $$

with the famous practical heuristic $T_{\text{opt}} \approx 20 \cdot N_{\text{opt}}$: about twenty tokens per parameter. Chinchilla (70 B parameters, 1.4 T tokens) beat Gopher (280 B parameters, 300 B tokens) at the same compute budget, and the recipe became canon.

For a modern GPT-4-class run at roughly $10^{25}$ FLOPs, this places $N$ near $10^{12}$ and $T$ near $10^{13}$, although in practice every laboratory now over-trains relative to Chinchilla because inference is paid in $N$, not in $T$. Llama 3 8B was trained on 15 trillion tokens, more than $1\,500 \times N$, and is spectacular value at serving time despite being deeply compute-suboptimal in the Chinchilla sense. The deployed-model arithmetic is dominated by inference, and inference scales with parameters alone.

The point of the scaling laws is not the precise exponents, which still drift as methodology improves. The point is the line: that doubling compute reliably halves the gap to the irreducible entropy floor, that the gains arrive smoothly rather than in fits, and that the practitioner can therefore plan a training run before launching it. That predictability is what made $10^{8}$-dollar pretraining runs financeable. Without it, the modern frontier would not exist.

A second consequence of the scaling laws, less often discussed but no less important, is that they impose a discipline on research. If you propose a new architectural tweak, the right question to ask is not "does it help at the small scale where I can afford to test it?" but "does it shift the line?" Hundreds of clever modifications produce a small gain at $10^{8}$ parameters and vanish at $10^{10}$. The scaling-laws methodology, sweep model size, fit the curve, project, has become the standard way to separate genuine improvements from noise.

Why post-training matters

A pretrained base model is an extraordinary engine but a hopeless assistant. Ask GPT-3 in 2020 "what is the capital of France?" and it might continue with "is a question often asked by tourists", perfectly plausible web text, perfectly useless. Pretraining maximises the likelihood of the next token in the corpus, and the corpus is full of unanswered questions, half-finished prose and rhetorical flourishes. The base model has no concept of being helpful because no token in its objective ever rewards it for being so.

Post-training is the discipline of bending this engine towards human use. SFT, RLHF and DPO together do three things at once. They teach the model to answer rather than to continue. They teach it to refuse certain classes of request, a non-trivial capability, since refusal in the right place is what makes a model deployable. And they teach it to converse: to accept a system prompt, to maintain persona, to format mathematics in LaTeX and code in fenced blocks, to ask clarifying questions when the user's intent is ambiguous.

The compute spent here is small. A fortnight of cluster time for SFT, perhaps another fortnight for preference learning, against the months that pretraining absorbs. Yet the difference in usefulness is qualitative. ChatGPT was, technically, GPT-3.5 with three weeks of post-training stitched on. The world reacted to the post-training, not to GPT-3.5.

Post-training also explains why two models trained on similar data and at similar scale can feel utterly different to use. The character of an assistant (its willingness to push back, its tone when it does not know, the granularity of its formatting) is set largely in stages two and three. Anthropic's Claude and OpenAI's GPT models read differently not because the underlying transformers are radically different but because their preference data, refusal taxonomies and constitutions are. Post-training is, in this sense, the locus of brand.

What this chapter covers

The chapter unfolds in roughly the order of the pipeline. §15.2 takes up emergent abilities and the mirage critique: the question of whether new capabilities really do appear discontinuously with scale. §15.3 walks through the pre-training recipe in detail: data, tokenisers, mixtures, deduplication, curriculum. §15.4 covers supervised fine-tuning. §15.5 derives RLHF from the policy-gradient theorem. §15.6 introduces DPO and the reward-free family. §15.7 turns to GRPO and the reasoning-model training that produced o1 and DeepSeek-R1. §15.8 examines test-time compute scaling, the trade between training the model harder and letting it think longer. §15.9 covers process supervision; §15.10–§15.12 discuss in-context learning, chain-of-thought and Constitutional AI. §15.13–§15.15 then move outward to tools, function calling, agents, retrieval-augmented generation and multimodality. §15.16 surveys the actual frontier as of early 2026. §15.17 contrasts open and closed weights. §15.18 turns to evaluation and §15.19 stitches everything together with an end-to-end recipe. §15.20–§15.22 close on inference, safety and a forward look.

The reader should expect this chapter to be denser than the others, because it is doing two jobs at once. It is teaching specific algorithms (RLHF, DPO, GRPO, MoE routing, retrieval, agentic loops); it is also acting as the integration point for the entire book. When we discuss the loss function in §15.5, we shall lean on the policy-gradient material from §11. When we discuss long-context attention in §15.9, we shall lean on the transformer mechanics from §13. When we discuss multimodal training in §15.15, we shall lean on the diffusion derivations from §14. The chapter does not re-derive these tools; it deploys them. If you found yourself skimming earlier chapters, this is the section in which their absence will be felt most sharply.

Where the frontier sits in early 2026

The roughly $10^{25}$-FLOP class, what the field calls "frontier", is now occupied by perhaps eight laboratories. As of April 2026 the picture looks roughly as follows.

OpenAI continues to ship the GPT and o-series in parallel: GPT-5 as the flagship generalist, o3 and its successors as the reasoning specialists. Anthropic ships the Claude 4 family, with the Sonnet 4 and Opus 4 tiers covering the breadth-versus-depth axis and an extended-thinking mode that bridges to the reasoning specialists. Google DeepMind's Gemini 2 series is the multimodal leader, with strong long-context performance and tight integration into the Workspace product surface. xAI's Grok 4 has closed much of the gap in 2025–2026 and competes seriously on reasoning benchmarks.

The open-weights frontier is led by DeepSeek (V3 dense and R1 reasoning) and Meta's Llama 4 family, with Qwen and Mistral close behind. DeepSeek-R1 (January 2025) was at-or-above closed reasoning models in mathematics and coding for perhaps a tenth of the cost, and the release has reshaped the field.

These models differ along axes that matter for deployment: instruction-following on long, structured prompts; mathematical and programming ability; multilingual coverage; tool-use reliability; safety properties under adversarial probing; latency and cost per million tokens; context window length. There is no single best model. Practitioners pick by use case, often routing requests across a small ensemble: a cheap fast model for the easy 95 % and a reasoning model for the hard 5 %.

The economics, too, have shifted dramatically. A million output tokens from a frontier model in 2023 cost roughly £40; in early 2026 the same call costs perhaps £2 to £5 depending on provider, with reasoning models priced at a premium that reflects the latent thinking tokens. Inference now dominates the field's compute budget, globally, more FLOPs are spent serving these models each week than were spent training any single one of them, and the engineering attention has followed.

Open vs closed model frontier

Open weights, Llama, Mistral, Qwen, DeepSeek, typically lag closed by six to twelve months on the headline benchmarks, though the gap fluctuates and occasionally inverts. DeepSeek-R1 was at or above the closed reasoning frontier on its release and stayed there for most of the first quarter of 2025. The asymmetry is interesting: open weights are often behind on raw capability but ahead on transparency, customisation, fine-tuning freedom and offline deployability. For a hospital that cannot send patient records to a third-party API, an 8B Llama running on a local box is not a worse option than GPT-5, it is the only option.

The "open frontier" democratises capability, and that is mostly a good thing. It also raises legitimate dual-use concerns, since a frontier model is the kind of artefact that does not stop being useful when downloaded by a malicious actor. The field's current consensus is that release should be calibrated to the marginal capability uplift over what is already public, and that the bar should rise as the absolute capability rises. Anthropic, OpenAI and DeepMind do not release weights. Meta and DeepSeek do. This is the central political fact of the field in 2026, and we return to it in §15.17 and §16.

A practical note for readers building on these systems. Open weights are the right starting point for any application that touches sensitive data, requires offline operation, or needs to be fine-tuned on domain-specific corpora. Closed APIs are the right starting point for general-purpose products where the latency, capability and cost-per-token of the closed frontier outweigh the loss of control. Most serious deployments now mix both, a closed model for the user-facing surface, an open model fine-tuned on internal data for the back-office pipeline. The architectural question of when to reach for which is itself a skill, and one that the chapters following will help to develop.

What you should take away

A frontier model is not just a transformer. It is a transformer plus pretraining at $\sim 10^{25}$ FLOPs plus supervised fine-tuning plus preference alignment plus, increasingly, reasoning RL and agentic scaffolding. All four stages matter, and skipping any of them produces something nobody would pay for.
Scaling laws are the empirical regularity that makes the whole industry possible. Loss falls as a power law in compute. Chinchilla (about twenty tokens per parameter) is the compute-optimal recipe; in practice, deployed models over-train because inference is paid in parameters, not data.
Post-training is where the personality lives. SFT, RLHF and DPO turn a fluent base model into a helpful, honest, refusing-where-it-should assistant, at perhaps a thousandth of the compute of pretraining.
The 2026 frontier is roughly eight laboratories deep, split between closed (OpenAI, Anthropic, DeepMind, xAI) and open (Meta, DeepSeek, Qwen, Mistral). Pick by use case, latency and deployability, not by leaderboard rank alone.
Everything beyond this section (RLHF, DPO, GRPO, MoE, retrieval, agents, multimodality, evaluation, serving, safety) is a specialisation of the four-step recipe above. Hold the recipe in mind as you read the rest of the chapter; each subsequent section is a magnifying glass over one of its stages.