Glossary

Verifiable Rewards

Verifiable rewards are reward signals derived from deterministic, machine-checkable correctness criteria rather than from human preferences or learned reward models. They are the cleanest possible reward source for reinforcement learning, and they underpin the post-2024 reasoning-model paradigm exemplified by openai-o3, deepseek-r1-zero, alphaproof and alphageometry-2.

The defining property of a verifiable reward is that there exists a function $\mathrm{verify}: \mathcal{X} \times \mathcal{Y} \to \{0, 1\}$ which, given a problem $x$ and a candidate answer $y$, returns 1 if and only if $y$ is correct, with no judgement, learning, or human in the loop. Canonical examples:

  • Mathematics with numerical answers, extract the boxed answer, normalise, compare to the reference. This is the GSM8K / MATH / AIME setting.
  • Code execution, run the candidate program against a hidden unit-test suite; pass-or-fail is the reward. APPS, HumanEval, LiveCodeBench, Codeforces all use this.
  • Formal proofs, submit the proof to the Lean / Coq / Isabelle kernel; the kernel accepts or rejects. AlphaProof's reward signal is Lean acceptance, full stop.
  • Regex / string match, answer must satisfy a structural predicate.
  • Game outcomes, win, lose or draw in chess, Go, board games, Schrittwieser's muzero and AlphaZero precursors.
  • Executable plans, the agent actually performs the task in a sandboxed environment and a checker validates the end state.

What unifies these is the absence of a learned reward model in the inner training loop. Compare with RLHF, where the reward $r_\phi(x, y)$ is itself a neural network trained on human preference comparisons; it can be hacked, it drifts as the policy distribution changes, and it requires constant relabelling. Verifiable rewards have none of these failure modes, they are the ground truth.

The paradigm has three practical consequences. Compute, not labels, becomes the bottleneck. Every rollout is a fresh perfectly-labelled training example, so RL throughput scales with GPU hours rather than human-rater hours. Reward hacking collapses to verifier hacking, which is much narrower; the only way to beat a Lean kernel is to actually produce a valid proof. Curriculum design becomes the central problem: the verifier handles correctness, but a difficulty distribution must be assembled so that the policy always has problems near the edge of its capability, too easy and the gradient vanishes, too hard and it never reaches a positive reward. AlphaProof's 100M synthetic problem set and DeepSeek's progressively harder math curriculum are responses to this.

Verifiable rewards also reshape evaluation. A model trained on verifiable rewards on math, code, and formal proofs typically inherits much of its general reasoning ability from these domains alone, the reasoning capability transfers to natural-language tasks even though those were not directly trained on. The paradigm thus offers a route to general reasoning improvement that does not require open-ended human judgement.

The hard limit is the domain restriction: verifiable rewards do not exist for "write a beautiful poem", "summarise this conversation tactfully", or "is this medical advice safe". For those domains RLHF, RLAIF, and Constitutional AI remain necessary. The frontier strategy is to use verifiable rewards wherever they apply and to fall back to learned rewards only where they must.

Related terms: Self-Play on Verifiable Rewards, Process Supervision, RLHF, RLAIF, AlphaProof Internals, AlphaGeometry 2, DeepSeek R1-Zero, OpenAI o3

Discussed in:

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).