Accuracy as a function of inference budget for three model strengths.
From the chapter: Chapter 15: Modern AI
Glossary: test time compute, scaling laws, reasoning models, openai o3
People: jared kaplan
References: Snell, 2024
Transcript
For a decade, scaling laws meant making models bigger and training them on more data. A new axis emerged in 2024: scaling test-time compute, that is, letting the model think for longer.
The horizontal axis is the log of the number of tokens spent reasoning at inference. The vertical axis is accuracy on a hard benchmark.
Each curve is a different model size. As compute grows, accuracy follows a sigmoid. Slow at first, then a steep climb, then a saturation plateau.
Notice the gap. A frontier model reaches the steep regime faster than a small one. The same compute budget buys more accuracy on a stronger base.
Lifting the saturation accuracy raises every curve. This is what reinforcement learning from human feedback and verifier-guided search do. They raise the ceiling that thinking can ever reach.
OpenAI o1, DeepSeek R1 and Claude 3.7 demonstrated this empirically. Doubling thinking time can be worth a tenfold increase in model parameters.