Glossary

Process Supervision

Process supervision is the training paradigm, introduced by Lightman, Kosaraju, Burda, Edwards, Baker, Lee, Leike, Schulman, Sutskever and Cobbe (OpenAI, 2023) in Let's Verify Step by Step, in which a reward model evaluates each step of a reasoning chain rather than only the final answer. It contrasts with outcome supervision, where only the terminal answer receives a reward signal. The PRM800K dataset released alongside the paper, 800,000 step-level human labels on MATH solutions, became the canonical benchmark for training prms.

The motivation is statistical and pedagogical. With outcome supervision a wrong final answer can receive a low reward even though the reasoning was largely correct, and a right final answer can receive a high reward despite lucky cancellation of errors. Credit assignment across a long chain of thought is therefore noisy. Process supervision provides a much denser learning signal: in a 20-step proof, each step is its own training example. Lightman et al. showed that a PRM trained on PRM800K and used to rank best-of-N samples solved 78% of MATH problems with $N=1860$, versus 72% for an outcome reward model, and the gap widens as $N$ grows because the PRM is much better at ruling out wrong but plausible-looking solutions.

Formally, given a reasoning trajectory $\tau = (s_1, s_2, \dots, s_T)$ produced for a problem $x$, an orm defines a single reward $r_\mathrm{ORM}(x, \tau) \in \{0, 1\}$, while a PRM defines a per-step reward

$$r_\mathrm{PRM}(x, s_{1:t}) \in [0, 1] \quad \text{for } t = 1, \dots, T,$$

typically interpreted as the probability that the partial solution can still be completed correctly. Composite trajectory scores combine these into either a minimum ("any wrong step ruins the solution") or a product, both of which dominate ORM scores for ranking.

Process labels can be human-annotated, which is expensive but high quality, or AI-generated. Wang et al. (2023) showed that a strong base model can label its own steps via Monte Carlo rollouts: for each prefix $s_{1:t}$, sample $K$ completions and score the prefix by the fraction that reach the correct final answer. This produces noisy but cheap process labels and has become the standard recipe in the open-source ecosystem (e.g. Math-Shepherd, OmegaPRM).

Process supervision plugs into reinforcement learning as a denser reward signal for ppo or grpo, where each token (or step) advantage is shaped by the PRM rather than the sparse final reward. This is partly why post-2024 reasoning models train so efficiently: the verifier provides feedback throughout the chain rather than only at the answer.

The technique generalises beyond mathematics. Code RL uses unit tests as a natural process signal (each test is a step-level reward); formal proof RL uses Lean kernel acceptance per tactic, which is exactly what alphaproof does at scale. The Lightman paper is the canonical citation that crystallised the principle for natural-language reasoning.

Related terms: Process Reward Model, Outcome Reward Model, Chain-of-Thought, o1 / Reasoning Models, RLHF, Verifiable Rewards, OpenAI o3

Discussed in:

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).