Inference-Time Scaling, Glossary, Textbook of AI

Inference-time scaling is the umbrella term for any technique that improves an LLM's answer quality by spending more compute per query at deployment time, rather than by training a larger model. It encompasses repeated sampling, search, extended reasoning, and verifier-guided generation, and it is the conceptual frame within which the test-time-compute-scaling-laws of Snell et al. (2024) sit.

The four main families are layered roughly by sophistication.

1. Repeated sampling (best-of-N, majority vote). Draw $N$ independent completions $\{y_1, \dots, y_N\}$ from the policy at non-zero temperature, then select. Best-of-N picks the highest-scoring completion under a verifier or reward model; majority vote (self-consistency, Wang et al. 2022) takes the mode of the answers, which is surprisingly competitive for math problems where the correct answer has a unique normalised form. The accuracy curve typically follows $\mathrm{acc}(N) \approx \mathrm{acc}_\infty - c/\sqrt{N}$ until coverage limits are hit. Compute scales linearly in $N$.

2. Beam search and lookahead. At each generation step, maintain $k$ partial trajectories ranked by cumulative log-probability (or a learned scorer), expand them all, and keep the top-$k$. Beam search is cheap (a constant-factor multiplier on greedy generation) but rarely outperforms greedy on reasoning tasks unless paired with a verifier, the beam tends to converge on locally-fluent but globally-wrong continuations.

3. Extended chain of thought / reasoning chains. Increase the number of intermediate reasoning tokens before the final answer by training (with process-supervision or RL on verifiable-rewards) and prompting the model to "think more". Modern reasoning models like openai-o3 and claude-4 extended thinking spend 10× to 100× more tokens on hard problems than on easy ones, with adaptive budgets. Compute scales linearly in chain length.

4. Tree search guided by a process verifier. The most compute-efficient strategy on hard problems. Maintain a tree of partial solutions, expand the most promising leaves under a prm score, prune branches whose minimum PRM drops below threshold. This is the AlphaZero-style strategy that powers alphaproof and is a significant component of the o-series. Compute scales with tree size, but the tree concentrates on promising regions.

Snell et al. (2024) showed that for a fixed compute budget, the optimal strategy depends on problem difficulty. Easy problems are best handled by best-of-N (the right answer is in the top few samples). Hard problems benefit from tree search and sequential revision. An adaptive router that chooses strategy per query can dominate any single fixed strategy, motivating the adaptive thinking budget designs in Claude 4 and o3.

The economic implication is that inference cost per task becomes a meaningful axis distinct from per-token cost. A model that uses 5× the tokens per hard problem but solves 10× more of them at the same dollar cost is preferable for hard-problem-heavy workloads. This has reshaped the API pricing landscape: providers now offer "fast" and "thinking" tiers with explicit reasoning budgets, and developers tune the budget per use case.

Inference-time scaling does not replace training-time scaling, the relevant comparison is per-FLOP at the use case in question. For high-throughput consumer chat, training-side scaling wins because it amortises across queries. For low-throughput, hard-problem use cases (research, agentic loops, formal proof), inference-time scaling wins by enormous margins. The frontier strategy is to build models that are good at both and to expose the trade-off as a runtime knob.

Discussed in:

Chapter 16: Ethics & Safety, Test-Time Compute

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).