15.9 Process supervision

When a reasoning chain is wrong, where did it go wrong? Outcome reward models (ORMs) score the final answer; process reward models (PRMs) score each step.

Lightman's result

Lightman et al. 2023, "Let's Verify Step by Step", collected the PRM800K dataset of step-level annotations on MATH problems. Annotators marked each line of a model-generated solution as correct, incorrect or neutral. They trained two reward models on the same base: an ORM (final-answer correctness only) and a PRM (per-step labels), then used each as a verifier in best-of-$N$ at inference time on a held-out set.

The PRM substantially outperformed the ORM at the same compute budget. Best-of-$N$ with a PRM at $N = 1860$ solved 78.2% of MATH problems, compared with 72.4% for the ORM. The gap widens with $N$: as $N$ grows, the ORM increasingly picks confidently wrong solutions, while the PRM penalises any solution containing a flawed step, even if the final answer accidentally comes out right.

The key conceptual point: outcome-only feedback teaches the model to be right but not principled. A reasoning chain with flawed steps that happens to land on the correct answer is rewarded by an ORM and penalised by a PRM. Process supervision aligns rewards with reasoning quality, not just answer correctness.

PRM training and use

A PRM is typically a Transformer with a per-position scalar head, trained with binary cross-entropy on step labels. At inference, you score each step as it is generated. The score is the probability the step is correct given the preceding steps.

Modern PRM-based search uses the PRM in three ways:

  • as a filter, prune any branch with a step score below a threshold;
  • as a selection criterion, in tree search, expand the highest-scoring branch first;
  • as a guidance signal, combine with the policy's log-probability to bias generation toward branches the PRM rates highly.

PRMs are a major component of the o1/o3 training recipe (OpenAI confirmed they used process supervision but did not publish details). DeepSeek-R1's verifiable-reward approach can be seen as an extreme form of process supervision where the verifier is the final answer checker but the RL gradient flows back through every step.

Limits

PRMs require step-level annotations, which are expensive. Annotators disagree on what counts as a "step" (a sentence? a paragraph? a logical inference?). Synthetic PRM training data, generated by stronger models, has become standard but inherits the stronger model's biases.

Math-Shepherd and automated PRM training

Wang et al. (2023) Math-Shepherd showed that you can train a PRM without any human step-level labels. Given a problem with a known final answer, take a partial reasoning prefix and roll out $K$ continuations from the policy; the empirical pass rate of those completions estimates the value of the prefix. Train the PRM to regress on this Monte Carlo estimate. The resulting PRM is competitive with PRMs trained on PRM800K-style human labels but requires only outcome supervision to train.

By 2026 most reasoning systems use a hybrid approach: a small amount of human-labelled step data to anchor the PRM to a useful notion of "step", then large-scale Math-Shepherd-style automatic expansion. The resulting PRM is good enough that it can be used both as a verifier in tree search and as a dense reward signal during RL fine-tuning, smoothing the credit assignment problem that pure outcome rewards face on long chains.

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).