Glossary

Self-Play on Verifiable Rewards

Self-play on verifiable rewards is the reinforcement-learning paradigm, central to openai-o3, deepseek-r1-zero and DeepMind's alphaproof, in which an LLM trains itself by generating solution attempts on tasks whose correctness is machine-verifiable, then updating against the verifier's verdict. The training loop has no human raters and no learned reward model in the inner loop, only a deterministic checker.

The generic loop is: (1) sample a problem $x$ from the curriculum; (2) sample a trajectory $\tau \sim \pi_\theta(\cdot \mid x)$ from the current policy; (3) evaluate $r(x, \tau) = \mathrm{verify}(x, \tau) \in \{0, 1\}$ using a deterministic verifier; (4) update $\theta$ via a policy-gradient method such as ppo or grpo using $r$ as the reward. The "self-play" framing emphasises that the policy is improving against its own past distribution, with no external teacher beyond the verifier.

The paradigm is clean in three ways that human-preference RL is not. First, the reward is deterministic and infinitely available, every additional rollout is a fresh, fully-supervised example, limited only by compute. Second, there is no reward-hacking of the verifier itself: a Lean kernel that accepts a proof has accepted that proof, full stop. Third, the optimisation has a fixed point, when the policy reaches verifier-saturation, there is nothing more to learn, unlike RLHF where reward models drift. This is why DeepSeek's R1-Zero could be trained from a base model with no SFT and no human feedback at all, using only rule-based rewards on math and code.

The class of problems where this works is the class with verifiable rewards: math problems with a numerical answer, code problems with a passing test suite, formal proofs that a kernel accepts, regex matches, executable plan traces. The closer to programme synthesis or formal mathematics the task is, the cleaner the reward signal. Conversely, on open-ended writing, summarisation, or dialogue the verifier collapses to either a brittle heuristic or a learned reward model, and we are back to RLHF.

alphaproof is the highest-compute instance of the paradigm: the Lean kernel acts as the verifier, the policy generates tactic sequences, and AlphaZero-style search with a learned value function explores the proof tree. AlphaGeometry 2 follows the same template with a symbolic engine as verifier. openai-o3 and deepseek-r1-zero generalise the recipe to natural-language math and code, with a mix of rule-based verifiers and lightweight execution sandboxes.

The paradigm has reframed the alignment landscape: where RLHF was the dominant post-training method circa 2022–2023, the post-2024 frontier is built primarily on verifiable-reward self-play in domains where it applies, falling back to RLHF and Constitutional AI only where verification is infeasible. Whether this paradigm extends beyond mathematics, code, and games into open-ended reasoning is one of the central open questions of frontier AI.

Related terms: Verifiable Rewards, RLHF, RLAIF, PPO, Group Relative Policy Optimization, DeepSeek R1-Zero, AlphaProof Internals, OpenAI o3

Discussed in:

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).