15.8 Test-time compute scaling

Test-time compute scaling · 1:05. Accuracy as a function of inference budget for three model strengths. Open transcript and references →
For most of the deep-learning era, "scaling" meant exactly three things: more parameters, more training tokens, and more pre-training compute. The Kaplan and Chinchilla scaling laws (chapter 11) compressed this into power-law curves, and laboratories spent the years from 2020 to 2024 walking up those curves with ever-larger clusters. By 2024, however, frontier capability gains from raw pre-training were slowing. Marginal returns on the next order of magnitude looked uncertain, training runs were costing into the hundreds of millions of dollars, and the highest-quality web text was running out. A new axis was needed.

That axis turned out to be inference. OpenAI's o3, announced in December 2024 2024, demonstrated that the same underlying model could be made dramatically more capable simply by allowing it to spend more compute at the moment of answering. On the ARC-AGI benchmark, a battery of abstract visual puzzles long held up as a stress test for fluid intelligence, o3 jumped from roughly thirty per cent (the previous frontier) to 87.5 per cent on the public evaluation, surpassing the human-baseline threshold for the first time. The model's weights were not larger than its predecessor's in any meaningful sense; what was larger was the budget of reasoning tokens it was permitted to expend per question. On the harder puzzles, o3 was reported to consume tens of millions of tokens per problem, at a per-task inference cost in the thousands of dollars.

This is the new lever. Capability is no longer a function of model size and training data alone, it is also a function of how much compute you are willing to spend per query. Section 15.7 described the reinforcement-learning recipes (PPO, GRPO, RL on verifiable rewards) that teach a model to make productive use of long reasoning traces; this section describes the family of techniques those models use at run time. The two are inseparable: training tells the model how to think, inference scaling lets it think for longer.

Symbols Used Here
$T_{\text{thinking}}$tokens of reasoning before answer
$N$number of independent samples drawn at inference
$V(x, y)$verifier or reward-model score for response $y$ on input $x$
$C_{\text{infer}}$compute spent per query at inference time

Best-of-$N$ sampling

The simplest, most direct way to spend more compute per query is to draw more than one answer and pick the best. Best-of-$N$ takes a fixed prompt, samples $N$ independent completions from the policy at non-zero temperature, scores each one with a verifier $V$, and returns the highest-scoring response. Mathematically the procedure is

$$\hat{y} = \arg\max_{y \in \{y_1, \ldots, y_N\}} V(x, y), \qquad y_i \sim \pi_\theta(\cdot \mid x).$$

The verifier $V$ can be many things. For mathematics or coding, it might be a unit-test runner or a Lean proof checker, these give a hard 0/1 signal and, when available, are decisive. For open-ended tasks, $V$ is typically a learned reward model trained on preference data (chapter 15.5–15.6). The quality of best-of-$N$ is bounded above by the quality of $V$; an imperfect verifier eventually plateaus, because beyond some $N$ the additional samples are dominated by responses that look good to the verifier but are not actually correct, a form of reward hacking by selection rather than by training.

Empirically, with a strong verifier, accuracy improves roughly logarithmically in $N$: doubling $N$ adds a roughly constant accuracy increment, until the plateau is reached. Snell et al. 2024 showed that for many problem distributions, spending $K\times$ more compute on best-of-$N$ at inference matches the accuracy gain of training a model $K\times$ larger. A Llama-3 8B equipped with verifier-guided search at inference time can match a Llama-3 70B doing single-shot inference on a fixed total-FLOP budget. This shifts the economics of frontier AI substantially: training is a fixed sunk cost, but inference compute is paid per query, so a strong inference recipe extends a model's useful capability without retraining.

Best-of-$N$ has two operational virtues that recommend it as a baseline. First, it is embarrassingly parallel, the $N$ samples have no dependencies on one another, so they can be batched across GPUs for trivial throughput. Second, it composes cleanly with everything else: you can run best-of-$N$ over chain-of-thought outputs, over tree-search trajectories, or over agent rollouts. The drawback is that cost scales linearly with $N$ for a fixed verifier, so for very large $N$ the per-query expense becomes prohibitive without smarter strategies.

Self-consistency

Self-consistency, introduced by Wang et al. 2022, is best-of-$N$'s clever cousin for tasks with a small set of canonical answers, multiple-choice questions, numerical answers, formal expressions. Instead of asking a verifier which response is best, it sample-aggregates: draw $N$ chains of thought, extract the final answer from each, and return the majority vote.

The intuition is that there are many wrong reasoning paths but only one (or a small number of) correct ones. If the model has a non-trivial chance of reaching the correct answer on any individual attempt, then the most common answer across many attempts will tend to be the correct one, even when individual chains contain errors. No verifier or reward model is needed; the aggregation is purely combinatorial. Wang and colleagues showed that self-consistency at $N = 40$ improves chain-of-thought arithmetic accuracy on benchmarks such as GSM8K and MATH by ten to twenty absolute percentage points over greedy decoding, a substantial gain for a method that is essentially a for loop and a Counter object.

Self-consistency is not a panacea. It assumes the answer space is discrete enough that "majority vote" is well defined, which limits it to tasks where the final answer can be normalised (a number, a multiple-choice letter, a parsed expression). For free-form generation it does not directly apply, although variants such as universal self-consistency use the model itself to identify the modal response among free-text answers. It also helps least on the easiest problems (where one sample suffices) and on the hardest (where every sample is wrong); the gains concentrate in the middle band of difficulty where the model is right somewhat more often than chance.

Chain-of-thought scaling

The third lever is to make each individual sample longer. Chain-of-thought, introduced as a prompting trick (chapter 15.11), becomes an architecture choice when the model is trained to think for many thousands of tokens before answering. OpenAI's o1 2024 in September 2024 was the first widely deployed example: rather than running search externally, the model performs the search inside its own context window using a long, hidden reasoning trace. The user sees only the final answer; the model has spent perhaps thirty thousand tokens deliberating internally.

The mathematics is unchanged from any other autoregressive Transformer, every thinking token costs the same forward pass as every other token. What changed was the training recipe (covered in section 15.7): RL on verifiable rewards taught the model to use thinking tokens productively, allocating them adaptively to problem difficulty. Easy questions receive a few hundred tokens of deliberation; hard combinatorial problems receive tens of thousands. The model has, in effect, learned its own answer to the question of how much compute a problem deserves.

DeepSeek-R1 DeepSeek-AI, 2025 in January 2025 was the first credible open-weights reproduction of this paradigm, using GRPO and showing that the long-CoT capability could be transferred to smaller models via distillation. By April 2026 the thinking-token paradigm is standard: Claude has extended thinking, Gemini has Deep Think, and most open-source families ship a <think>...</think> mode. Many are hybrids, they choose, per query, whether to think and for how long, gated either by a user toggle or by the model's own confidence estimate.

The crucial empirical finding is that the relationship between thinking-token budget $T_{\text{thinking}}$ and accuracy follows another power-law-like curve. Doubling the thinking budget yields a roughly constant accuracy increment, up to a saturation point. This is structurally identical to the pre-training scaling laws but with a different independent variable, and it provides the reason that o3's performance on ARC-AGI was so consequential: the same model, given access to a vastly larger thinking budget, climbed dozens of percentage points on a benchmark long thought to require fundamental architectural advances.

Tree search and process verification

For longer-horizon problems, simple sampling becomes wasteful: many candidates share an identical prefix and only diverge late, so independent samples redundantly recompute the same early reasoning. Tree-search methods exploit this structure. Each node represents a partial reasoning state; expansion proposes successor states, evaluation scores each candidate (typically with a process reward model, see section 15.9), and selection chooses which node to expand next using upper-confidence-bound heuristics borrowed from MCTS.

Tree-of-Thoughts Yao, 2024 formalised this in 2023; subsequent work has applied MCTS over reasoning steps for mathematics (where partial solutions can be sanity-checked against axioms) and for code (where partial programs can be type-checked or test-run). Process reward models are central: rather than scoring only the final answer, they score each intermediate step, allowing the search to prune unpromising branches before they consume budget. AlphaProof and AlphaGeometry 2 (DeepMind, 2024) used tree search over Lean tactics to reach silver-medal IMO performance, and the technique generalises wherever a reasonable per-step reward is available.

In 2025–2026 tree search has been partly displaced, and partly subsumed, by iterated self-refinement: the model produces a candidate, critiques it, revises, and iterates serially until a budget is exhausted or the critique is satisfied. Self-Refine (Madaan et al., 2023), Reflexion (Shinn et al., 2023), and constitutional self-critique (the same machinery as section 15.12) all spend test-time compute serially rather than in parallel, and they tend to compose well with long-CoT thinking-token models.

The new scaling law

We can now write a fuller expression for frontier-model capability:

$$\text{Capability} = f(\text{model size}, \text{training data}, \text{training compute}, \text{inference compute}).$$

Until 2024 the fourth term was approximately zero, production inference budgets were dominated by latency constraints and by the desire to serve many users cheaply, not by per-query capability. Today the fourth term is significant and, on the hardest tasks (research mathematics, advanced coding, ARC-AGI), it dominates. The pre-training scaling laws have not been repealed; they have been joined by a parallel inference-scaling law with its own slope and asymptote, set by verifier strength, sample diversity, and the model's trained ability to use long reasoning productively.

The economic consequences are still being worked out. A model that costs ten thousand dollars per ARC-AGI task is not commercially viable for most workflows, but it is viable for a research mathematician verifying a single proof, a clinician assembling a complex differential, or an engineer hunting a rare bug. We should expect the next few years to produce a Pareto frontier of inference budgets, fast, cheap responses for routine queries; deep, expensive deliberation for problems where it pays. The user, or an upstream router, decides where on the frontier each query sits.

What you should take away

  1. Scaling now has four axes, not three: model size, training data, training compute, and inference compute. The fourth was previously near zero; it is now where most frontier capability gains are coming from.
  2. Best-of-$N$ is the simplest inference-scaling recipe, sample $N$ completions, pick the best by a verifier, and it scales accuracy roughly logarithmically in $N$ until verifier quality plateaus.
  3. Self-consistency (Wang 2022) is best-of-$N$ for tasks with canonical answers, using majority vote instead of a verifier. It improves multi-step arithmetic by 10–20 percentage points at $N=40$.
  4. Thinking-token models (o1, o3, DeepSeek-R1) train the model to do long internal chains of thought; o3 reached 87.5 per cent on ARC-AGI by spending tens of millions of tokens per task.
  5. Inference scaling does not replace training; it is enabled by training (RL on verifiable rewards from section 15.7) and constrained by verifier quality. Expect a Pareto frontier of inference budgets, with cheap fast responses and expensive deep deliberation chosen per query.

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).