Test-Time Compute Scaling, Glossary, Textbook of AI

Test-time compute scaling is the use of additional compute at inference time, rather than at training time, to improve the performance of a fixed model. Standard approaches include:

Best-of-N sampling: generate N candidate responses, select the best by an evaluator; Self-consistency: generate many chain-of-thought traces, take the most common final answer; Tree-of-thoughts: branch over candidate continuations and prune based on intermediate evaluations; Iterative refinement: have the model critique and revise its own output multiple times; Extended chain-of-thought: have the model generate very long internal reasoning before answering, the approach taken by reasoning models.

Empirically, performance scales smoothly with test-time compute on many tasks, in a manner analogous to the training-compute scaling laws. Snell et al. (2024) and others have characterised these test-time scaling laws and the conditions under which they hold.

The September 2024 release of OpenAI's o1, which made test-time compute scaling explicit by training the model to use extended reasoning traces, is generally regarded as the moment at which test-time compute became a central axis of frontier AI capability, comparable to training-time scale.

Test-time scaling has substantial implications for the economics of AI deployment: the same model can be made dramatically more capable for high-value queries by spending more inference compute, while routine queries can be served cheaply with minimal compute. This decouples capability from model size in ways that previous paradigms did not.

Related terms: o1 / Reasoning Models, Chain-of-Thought, Scaling Laws

Discussed in:

Chapter 15: Modern AI, Modern AI

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).