A Process Reward Model (PRM) is a learned scoring function $r_\mathrm{PRM}(x, s_{1:t}) \in [0, 1]$ that takes a problem $x$ and a partial reasoning trajectory $s_{1:t}$ and returns the probability that the prefix is on track to a correct solution. Introduced as a contrast to the orm by Uesato et al. (DeepMind, 2022) and scaled up by Lightman et al. (OpenAI, 2023, Let's Verify Step by Step), the PRM is the cornerstone of post-2024 reasoning systems.
Training data is per-step labels of the form (problem, step-prefix, label $\in \{\mathrm{correct}, \mathrm{neutral}, \mathrm{wrong}\}$). Two label sources are common.
Human labelling. PRM800K (Lightman et al., 2023) contains ~800,000 step-level human annotations on MATH solutions generated by GPT-4. Each step is marked correct, ambiguous, or incorrect, with the chain truncated at the first incorrect step. Human labels are high-fidelity but expensive, PRM800K cost on the order of a million dollars to collect.
AI labelling via Monte Carlo rollouts. Wang et al.'s Math-Shepherd (2023) and Luo et al.'s OmegaPRM (2024) use the base model itself to label steps automatically. For each prefix $s_{1:t}$, they sample $K$ completions and assign
$$r_\mathrm{PRM}(x, s_{1:t}) \approx \frac{1}{K} \sum_{k=1}^K \mathbb{1}[\text{rollout } k \text{ reaches correct answer}].$$
This estimates the value function of the prefix under the rollout policy. The labels are noisier than human ones but scale to millions of trajectories essentially for free, and modern open-source PRMs (Skywork-PRM, Qwen-PRM, math-shepherd) are trained primarily on synthetic process labels.
The PRM is typically a transformer initialised from the base policy with a scalar head per step token, trained with binary cross-entropy on the step-level labels. At inference it takes the running solution and emits a score after each newly written step.
PRMs are used in three modes. Best-of-N reranking: aggregate per-step scores into a trajectory score (typically the minimum) and pick the best of $N$ samples. Search guidance: at each branching point in a tree-search expand the highest-PRM child, prune branches whose minimum drops below threshold; this is the AlphaZero-style use that powers alphaproof and o1-class systems. RL reward shaping: feed per-step PRM scores into ppo or grpo as dense advantages, which sharply reduces credit-assignment variance compared to outcome-only training.
A subtle pitfall is PRM hacking: a policy trained against a fixed PRM will learn to emit step prefixes that score highly without actually being correct (e.g. confident-sounding but vacuous algebra). Mitigations include retraining the PRM on adversarial trajectories, ensembling multiple PRMs, and using the PRM only at inference rather than as an RL reward. The post-DeepSeek-R1 trend has been to lean more on rule-based verifiers (Lean, unit tests) and less on neural PRMs precisely because of this brittleness.
Related terms: Process Supervision, Outcome Reward Model, Chain-of-Thought, o1 / Reasoning Models, OpenAI o3, AlphaProof Internals, PPO, Group Relative Policy Optimization
Discussed in:
- Chapter 16: Ethics & Safety, Process vs Outcome Supervision