Charlie Snell, Jaehoon Lee, Kelvin Xu, & Aviral Kumar (2024), References, Textbook of AI

Charlie Snell, Jaehoon Lee, Kelvin Xu, & Aviral Kumar (2024)

arXiv:2408.03314.

URL: https://arxiv.org/abs/2408.03314

Abstract. An empirical study of the trade-off between training compute and inference compute for language models. The authors compare best-of-$N$ sampling, sequential refinement and tree search across reasoning benchmarks at fixed inference budget, and report that for many problem distributions spending $K\times$ more compute on best-of-$N$ at inference can match or exceed training a $K\times$-larger model. The paper provided one of the first quantitative validations that test-time compute is a competitive scaling axis and informed the inference-time-compute paradigm later inaugurated by o1.

Tags: language-models reasoning inference

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters