1985–, Computer scientist
John Schulman is an American computer scientist whose 2015 PhD work at Berkeley (under Pieter Abbeel) introduced Trust Region Policy Optimization (TRPO) and the simpler Proximal Policy Optimization (PPO, 2017). PPO has become the dominant policy-gradient algorithm for deep reinforcement learning and the workhorse of RLHF training of large language models, InstructGPT, ChatGPT, Claude, Gemini and the post-training stages of nearly every modern LLM use PPO or close variants.
Schulman was a co-founder of OpenAI (2015) and led its reinforcement-learning and post-training research. He was a major contributor to ChatGPT and GPT-4. In August 2024 he left OpenAI for Anthropic, citing AI safety as the motivating reason. He left Anthropic in early 2025 to pursue a startup.
Video
Related people: Sam Altman, Dario Amodei
Works cited in this book:
- Concrete Problems in AI Safety (2016) (with Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, Dan Mané)
- Proximal Policy Optimization Algorithms (2017) (with Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov)
- Training language models to follow instructions with human feedback (2022) (with Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, Ryan Lowe)
- Scaling Laws for Reward Model Overoptimization (2022) (with Leo Gao, Jacob Hilton)
- Let's Verify Step by Step (2023) (with Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, Ilya Sutskever, Karl Cobbe)
Discussed in:
- Chapter 1: What Is AI?, A Brief History of AI