Glossary

Outcome Reward Model

An Outcome Reward Model (ORM) is a learned scoring function $r_\mathrm{ORM}(x, \tau) \in [0, 1]$ that judges only the terminal answer of a reasoning trajectory $\tau = (s_1, \dots, s_T)$ produced for a problem $x$. It contrasts with a prm, which scores each intermediate step. The distinction was sharpened by Lightman et al. (2023, Let's Verify Step by Step), who used an ORM as the baseline against which process supervision was measured.

Operationally an ORM is trained as a binary classifier: given a problem and a complete generated solution, predict whether the final answer matches a reference. Training data is cheap to collect, for math one only needs (problem, solution, ground-truth-answer) triples, and the label is just whether the boxed answer matches. The model is typically a transformer initialised from the base policy with a scalar reward head, fine-tuned with cross-entropy:

$$\mathcal{L}_\mathrm{ORM} = -\mathbb{E}_{(x,\tau,y)}\left[ y \log r_\mathrm{ORM}(x, \tau) + (1-y) \log(1 - r_\mathrm{ORM}(x, \tau)) \right].$$

The ORM is then used in two main ways. As a best-of-N selector it ranks $N$ sampled solutions and returns the highest-scoring one, providing an inference-time accuracy boost without retraining. As a reward signal for RL it gives a final-step reward that propagates back via ppo or grpo; because the reward is sparse, the policy gradient is high variance and credit assignment to early reasoning steps is weak.

The empirical comparison with process supervision is consistent: PRMs beat ORMs at large $N$ and on hard problems, while ORMs are competitive on easy problems and are far cheaper to train. Lightman et al. reported that on MATH at $N=1860$, the PRM solved 78.2% versus the ORM's 72.4%; on GSM8K the gap is much smaller because the chains are shorter and the final answer correlates strongly with chain quality.

ORMs remain the workhorse for many production setups because outcome labels are essentially free in domains with verifiable-rewards: code passes or fails its tests, a math answer matches or does not, a Lean proof type-checks or does not. In these settings the ORM degenerates into a deterministic verifier, and the practical question becomes whether to invest the additional labelling effort to build a process verifier as well.

A subtler point is that ORM-trained policies tend to develop reward hacking behaviours that look correct at the answer level but contain errors mid-chain, e.g. silently dropping a sign and silently restoring it, or making two cancelling errors. A PRM catches these; an ORM does not. This is one reason post-2024 reasoning systems lean towards PRMs or hybrid PRM/ORM verifiers.

Related terms: Process Reward Model, Process Supervision, RLHF, PPO, Verifiable Rewards, Chain-of-Thought

Discussed in:

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).