John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, & Oleg Klimov (2017), References, Textbook of AI

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, & Oleg Klimov (2017)

arXiv.

DOI: https://doi.org/10.48550/arxiv.1707.06347

Abstract. Introduces Proximal Policy Optimization (PPO), a policy gradient algorithm with a clipped surrogate objective that balances simplicity, stability, and sample efficiency. PPO is the reinforcement learning algorithm most commonly used in RLHF.

Tags: reinforcement-learning ppo rlhf

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).

Proximal Policy Optimization Algorithms