Abstract. An empirical study of the trade-off between training compute and inference compute for language models. The authors compare best-of-$N$ sampling, sequential refinement and tree search across reasoning benchmarks at fixed inference budget, and report that for many problem distributions spending $K\times$ more compute on best-of-$N$ at inference can match or exceed training a $K\times$-larger model. The paper provided one of the first quantitative validations that test-time compute is a competitive scaling axis and informed the inference-time-compute paradigm later inaugurated by o1.
Tags:language-modelsreasoninginference
This site is currently in Beta. Contact: Chris Paton