Best-of-N Sampling, Glossary, Textbook of AI

Best-of-$N$ sampling generates multiple candidate outputs and selects the best according to some scoring function. Used both as a deployment-time technique and as the basis for rejection sampling in RLHF training.

Algorithm:

Sample $N$ candidate outputs $y_1, \ldots, y_N$ from the model.
Score each by a reward / quality function $r$: $r_i = r(x, y_i)$.
Return $y^* = y_{\arg\max_i r_i}$.

The scoring function can be:

A trained reward model (RLHF).
A separate verifier model (math, code: check execution / proof correctness).
The model's own log-likelihood (for diverse-but-coherent generation).
Human ranking (impractical at deployment but used to collect training data).
A regex / unit test / formal-spec satisfier (for code).

Empirical benefits: best-of-$N$ provides log-scaling improvements in quality:

$$\mathbb{E}[\max_i r(y_i)] - \mathbb{E}[r(y_1)] \approx \sigma_r \sqrt{2 \log N}$$

for Gaussian $r$, where $\sigma_r$ is the standard deviation of rewards. Doubling $N$ adds a constant amount of expected reward; large $N$ provides large gains.

Reward hacking at extreme $N$: the chosen $y^*$ is the one that most exploits the reward model. If the reward model is imperfect (which it always is), best-of-$N$ at $N \gg 100$ can produce outputs that score well but are actually bad, selecting for the reward model's blind spots. This is Goodhart's law in action: optimising for a proxy makes it stop being a good proxy.

KL-regularised best-of-$N$ (Khanov, Burapacheep, Li 2024): pick the candidate that maximises $r(y) - \beta D_\mathrm{KL}(y \| \pi_\mathrm{ref})$ , penalise candidates that drift far from a reference distribution. Mitigates reward hacking.

In RLHF training:

Rejection sampling fine-tuning (RSF): collect best-of-$N$ outputs from the SFT model (rated by reward model), fine-tune on these. Simpler than PPO and often nearly as effective. Used by Llama-2 and many subsequent models.

At deployment:

For high-stakes applications (medical AI, legal, code generation), best-of-$N$ with a reliable verifier provides quality improvements at the cost of $N \times$ inference compute. With LLM inference now relatively cheap, $N = 8$ to $32$ is increasingly common for high-stakes deployments.

Comparison to self-consistency:

Self-consistency is best-of-$N$ where the "scoring function" is majority vote over extracted answers.
More general best-of-$N$ uses an external scoring function (reward model, verifier, etc.).
Both share the same compute-trade-off character: more samples = better quality at higher cost.

Reasoning models (o1, DeepSeek-R1) effectively internalise something analogous to best-of-$N$ during training, the RL training rewards single-sample behaviour that matches what best-of-$N$ ensembling would produce.

Related terms: RLHF, Self-Consistency, Test-Time Compute Scaling

Discussed in:

Chapter 15: Modern AI, Modern AI

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).