Test-time compute scaling laws are the empirical relationships, formalised by Snell, Lee, Xu and Kumar (Google DeepMind, 2024) in Scaling LLM Test-Time Compute Optimally Can Be More Effective than Scaling Model Parameters, that describe how a fixed model's accuracy improves as additional compute is allocated to each inference query. They are the inference-side analogue of the chinchilla-scaling-law for training, and they reframed the post-2024 frontier as a contest over reasoning compute rather than parameter count.
The basic empirical claim is that for many reasoning benchmarks (MATH, GSM8K, ARC), a $\beta$-billion parameter model can match a $14\beta$-billion parameter model on hard problems if it is allowed to spend a comparable additional compute budget at inference. Concretely, Snell et al. fit log-linear relationships of the form
$$\mathrm{accuracy}(C_\mathrm{test}) = a + b \log C_\mathrm{test},$$
where $C_\mathrm{test}$ is the FLOPs spent generating, sampling, verifying, or searching at test time. The slope $b$ depends on the inference strategy and on problem difficulty: for easy problems, best-of-N with a verifier saturates after a few samples; for hard problems, sequential revision (the model edits its own draft) and tree search guided by a process reward model continue to extract gains for $N$ up to a few hundred.
The headline trade-off is the compute equivalence frontier. For a fixed compute budget split between pre-training FLOPs $C_\mathrm{train}$ and per-query inference FLOPs $C_\mathrm{test}$, there is an optimal split that depends on (i) how often the model will be queried and (ii) the difficulty distribution of those queries. Models that will be deployed at high QPS should bank more compute in training; specialised reasoning systems answering a handful of hard questions per session should bank it in inference. OpenAI's openai-o3 and DeepSeek's deepseek-r1-zero sit far towards the inference-heavy end of this frontier.
Three families of inference-time strategies are studied. Repeated sampling (best-of-N, majority vote) draws $N$ independent completions and selects via a verifier or majority. The accuracy gain follows roughly $1 - (1-p)^N$ for an oracle verifier, capped by coverage, the probability that the correct answer appears in the sample at all. Sequential revision lets the model condition on its previous attempt, which empirically gives better scaling on hard problems because it can pursue a single line of thought further. Tree search with a prm prunes wrong branches early and is the most compute-efficient strategy when a strong process verifier is available; this is the strategy that AlphaProof and o1-style systems inherit from AlphaZero.
A practical consequence is that adaptive compute, spending more on hard queries and less on easy ones, is dominant. Snell et al. show that an oracle that knows problem difficulty can match fixed-budget best-of-1024 with an average of best-of-32, a $32\times$ saving. Real systems approximate this with confidence-based early stopping or with a router that allocates a thinking budget based on a cheap classifier.
Test-time scaling laws explain why a small open model with a strong prm and a search procedure can outperform a much larger base model on math and code, and why frontier vendors now publish reasoning benchmarks under explicit compute budgets ("o3 high", "Claude 4 extended thinking") rather than as single numbers. They also raise an economic question for inference providers: the marginal cost per correct answer becomes a meaningful pricing axis, distinct from cost per token.
Related terms: Chain-of-Thought, o1 / Reasoning Models, OpenAI o3, Claude 4 Family, DeepSeek R1-Zero, Process Reward Model, Inference-Time Scaling
Discussed in:
- Chapter 16: Ethics & Safety, Test-Time Compute