Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, & Karl Cobbe (2023)
arXiv:2305.20050.
URL: https://arxiv.org/abs/2305.20050
Abstract. OpenAI's process-reward-model (PRM) paper. Collects PRM800K, a dataset of step-level human annotations on model-generated MATH solutions; each line of each solution is labelled correct, incorrect or neutral. The authors train two reward models, an outcome-reward model that scores only the final answer and a process-reward model that scores each step, and compare them as best-of-$N$ verifiers. Process supervision substantially outperforms outcome supervision and is more sample-efficient. The paper established PRMs as the dominant verifier architecture for reasoning models and laid groundwork for the o1-style training pipeline.
Tags: alignment rlhf reasoning
Cited in: