Process supervision is the training paradigm, introduced by Lightman, Kosaraju, Burda, Edwards, Baker, Lee, Leike, Schulman, Sutskever and Cobbe (OpenAI, 2023) in Let's Verify Step by Step, in which a reward model evaluates each step of a reasoning chain rather than only the final answer. It contrasts with outcome supervision, where only the terminal answer receives a reward signal. The PRM800K dataset released alongside the paper, 800,000 step-level human labels on MATH solutions, became the canonical benchmark for training prms.
The motivation is statistical and pedagogical. With outcome supervision a wrong final answer can receive a low reward even though the reasoning was largely correct, and a right final answer can receive a high reward despite lucky cancellation of errors. Credit assignment across a long chain of thought is therefore noisy. Process supervision provides a much denser learning signal: in a 20-step proof, each step is its own training example. Lightman et al. showed that a PRM trained on PRM800K and used to rank best-of-N samples solved 78% of MATH problems with $N=1860$, versus 72% for an outcome reward model, and the gap widens as $N$ grows because the PRM is much better at ruling out wrong but plausible-looking solutions.
Formally, given a reasoning trajectory $\tau = (s_1, s_2, \dots, s_T)$ produced for a problem $x$, an orm defines a single reward $r_\mathrm{ORM}(x, \tau) \in \{0, 1\}$, while a PRM defines a per-step reward
$$r_\mathrm{PRM}(x, s_{1:t}) \in [0, 1] \quad \text{for } t = 1, \dots, T,$$
typically interpreted as the probability that the partial solution can still be completed correctly. Composite trajectory scores combine these into either a minimum ("any wrong step ruins the solution") or a product, both of which dominate ORM scores for ranking.
Process labels can be human-annotated, which is expensive but high quality, or AI-generated. Wang et al. (2023) showed that a strong base model can label its own steps via Monte Carlo rollouts: for each prefix $s_{1:t}$, sample $K$ completions and score the prefix by the fraction that reach the correct final answer. This produces noisy but cheap process labels and has become the standard recipe in the open-source ecosystem (e.g. Math-Shepherd, OmegaPRM).
Process supervision plugs into reinforcement learning as a denser reward signal for ppo or grpo, where each token (or step) advantage is shaped by the PRM rather than the sparse final reward. This is partly why post-2024 reasoning models train so efficiently: the verifier provides feedback throughout the chain rather than only at the answer.
The technique generalises beyond mathematics. Code RL uses unit tests as a natural process signal (each test is a step-level reward); formal proof RL uses Lean kernel acceptance per tactic, which is exactly what alphaproof does at scale. The Lightman paper is the canonical citation that crystallised the principle for natural-language reasoning.
Related terms: Process Reward Model, Outcome Reward Model, Chain-of-Thought, o1 / Reasoning Models, RLHF, Verifiable Rewards, OpenAI o3
Discussed in:
- Chapter 16: Ethics & Safety, Process vs Outcome Supervision