Best-of-$N$ sampling generates multiple candidate outputs and selects the best according to some scoring function. Used both as a deployment-time technique and as the basis for rejection sampling in RLHF training.
Algorithm:
- Sample $N$ candidate outputs $y_1, \ldots, y_N$ from the model.
- Score each by a reward / quality function $r$: $r_i = r(x, y_i)$.
- Return $y^* = y_{\arg\max_i r_i}$.
The scoring function can be:
- A trained reward model (RLHF).
- A separate verifier model (math, code: check execution / proof correctness).
- The model's own log-likelihood (for diverse-but-coherent generation).
- Human ranking (impractical at deployment but used to collect training data).
- A regex / unit test / formal-spec satisfier (for code).
Empirical benefits: best-of-$N$ provides log-scaling improvements in quality:
$$\mathbb{E}[\max_i r(y_i)] - \mathbb{E}[r(y_1)] \approx \sigma_r \sqrt{2 \log N}$$
for Gaussian $r$, where $\sigma_r$ is the standard deviation of rewards. Doubling $N$ adds a constant amount of expected reward; large $N$ provides large gains.
Reward hacking at extreme $N$: the chosen $y^*$ is the one that most exploits the reward model. If the reward model is imperfect (which it always is), best-of-$N$ at $N \gg 100$ can produce outputs that score well but are actually bad, selecting for the reward model's blind spots. This is Goodhart's law in action: optimising for a proxy makes it stop being a good proxy.
KL-regularised best-of-$N$ (Khanov, Burapacheep, Li 2024): pick the candidate that maximises $r(y) - \beta D_\mathrm{KL}(y \| \pi_\mathrm{ref})$ , penalise candidates that drift far from a reference distribution. Mitigates reward hacking.
In RLHF training:
Rejection sampling fine-tuning (RSF): collect best-of-$N$ outputs from the SFT model (rated by reward model), fine-tune on these. Simpler than PPO and often nearly as effective. Used by Llama-2 and many subsequent models.
At deployment:
For high-stakes applications (medical AI, legal, code generation), best-of-$N$ with a reliable verifier provides quality improvements at the cost of $N \times$ inference compute. With LLM inference now relatively cheap, $N = 8$ to $32$ is increasingly common for high-stakes deployments.
Comparison to self-consistency:
- Self-consistency is best-of-$N$ where the "scoring function" is majority vote over extracted answers.
- More general best-of-$N$ uses an external scoring function (reward model, verifier, etc.).
- Both share the same compute-trade-off character: more samples = better quality at higher cost.
Reasoning models (o1, DeepSeek-R1) effectively internalise something analogous to best-of-$N$ during training, the RL training rewards single-sample behaviour that matches what best-of-$N$ ensembling would produce.
Related terms: RLHF, Self-Consistency, Test-Time Compute Scaling
Discussed in:
- Chapter 15: Modern AI, Modern AI