Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, & Karl Cobbe (2023), References, Textbook of AI

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, & Karl Cobbe (2023)

arXiv:2305.20050.

URL: https://arxiv.org/abs/2305.20050

Abstract. OpenAI's process-reward-model (PRM) paper. Collects PRM800K, a dataset of step-level human annotations on model-generated MATH solutions; each line of each solution is labelled correct, incorrect or neutral. The authors train two reward models, an outcome-reward model that scores only the final answer and a process-reward model that scores each step, and compare them as best-of-$N$ verifiers. Process supervision substantially outperforms outcome supervision and is more sample-efficient. The paper established PRMs as the dominant verifier architecture for reasoning models and laid groundwork for the o1-style training pipeline.

Tags: alignment rlhf reasoning

Cited in:

Chapter 16: Ethics & Safety

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).

Let's Verify Step by Step