1992–, Computer scientist
Paul Christiano is an American AI alignment researcher whose 2017 paper at OpenAI Deep Reinforcement Learning from Human Preferences (with Jan Leike, Tom Brown, Miljan Martic, Shane Legg and Dario Amodei) introduced the basic RLHF (Reinforcement Learning from Human Feedback) technique that would five years later become the basis of InstructGPT and ChatGPT.
Christiano left OpenAI in 2021 to found the Alignment Research Center (ARC), focusing on theoretical and empirical AI alignment research. ARC has produced influential work on dangerous-capability evaluations (METR's evaluations of GPT-4 and successors, run as a sister organisation), on eliciting latent knowledge (a thought-experimental research direction for detecting deceptive AI), and on iterated amplification (a proposed scalable alignment scheme).
In 2024 Christiano joined the US AI Safety Institute at NIST as head of AI safety. He has been one of the most prominent voices in the AI alignment community, both for his technical work and for his sober public commentary on the long-term risks of advanced AI.
Video
Related people: Dario Amodei, Geoffrey Irving
Works cited in this book:
- Concrete Problems in AI Safety (2016) (with Dario Amodei, Chris Olah, Jacob Steinhardt, John Schulman, Dan Mané)
- Deep reinforcement learning from human preferences (2017) (with Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, Dario Amodei)
- AI safety via debate (2018) (with Geoffrey Irving, Dario Amodei)
- Eliciting Latent Knowledge: How to Tell if Your Eyes Deceive You (2021) (with Ajeya Cotra, Mark Xu)
- Recursively Summarizing Books with Human Feedback (2021) (with Jeff Wu, Long Ouyang, Daniel M. Ziegler, Nisan Stiennon, Ryan Lowe, Jan Leike)
- Training language models to follow instructions with human feedback (2022) (with Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Jan Leike, Ryan Lowe)
- My views on "doom" (2023)
- Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training (2024) (with Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M. Ziegler, Tim Maxwell, Newton Cheng, Adam Jermyn, Amanda Askell, Ansh Radhakrishnan, Cem Anil, David Duvenaud, Deep Ganguli, Fazl Barez, Jack Clark, Kamal Ndousse, Kshitij Sachan, Michael Sellitto, Mrinank Sharma, Nova DasSarma, Roger Grosse, Shauna Kravec, Yuntao Bai, Zachary Witten, Marina Favaro, Jan Brauner, Holden Karnofsky, Samuel R. Bowman, Logan Graham, Jared Kaplan, Sören Mindermann, Ryan Greenblatt, Buck Shlegeris, Nicholas Schiefer, Ethan Perez)
Discussed in:
- Chapter 16: Ethics & Safety, AI Safety