Proximal Policy Optimization (PPO), introduced by John Schulman et al. (2017), is the dominant policy-gradient algorithm for deep RL. PPO uses a clipped surrogate objective that prevents overly large policy updates within each step.
Let $\rho_t(\theta) = \pi_\theta(a_t | s_t) / \pi_{\theta_\mathrm{old}}(a_t | s_t)$ be the importance-sampling ratio between the new policy and the old. PPO maximises
$$\mathcal{L}^{\mathrm{PPO}}(\theta) = \mathbb{E}_t\!\left[\min\!\bigl(\rho_t(\theta) \hat A_t, \, \mathrm{clip}(\rho_t(\theta), 1 - \varepsilon, 1 + \varepsilon) \hat A_t\bigr)\right]$$
where $\hat A_t$ is an advantage estimate (typically GAE) and $\varepsilon \approx 0.1$ to $0.3$ is the clip range. The clip prevents $\rho_t$ from moving too far from 1, limiting how much the policy can change per update.
Combined objective with value-function loss and entropy bonus:
$$\mathcal{L}^{\mathrm{total}} = \mathcal{L}^{\mathrm{PPO}} - c_1 (V_\theta(s_t) - V_t^\mathrm{target})^2 + c_2 H[\pi_\theta(\cdot | s_t)]$$
Training: $K$ epochs of mini-batch SGD over each rollout (typically $K = 4$ to $10$, batch size 64-256).
PPO replaced TRPO (Trust Region Policy Optimization, Schulman 2015) which used a hard KL constraint requiring expensive natural-gradient computation. PPO achieves comparable performance with simple SGD and clipping.
Standard hyperparameters (Stable Baselines defaults): $\varepsilon = 0.2$, $\gamma = 0.99$, GAE $\lambda = 0.95$, learning rate 3e-4, $c_1 = 0.5$, $c_2 = 0.01$.
PPO is the workhorse of RLHF training of large language models, InstructGPT, ChatGPT, Claude (early), Gemini, LLaMA-Chat all use PPO or close variants. GRPO (DeepSeek 2024) replaced PPO's value baseline with group-relative advantages, simplifying the pipeline for reasoning-model training.
Video
Related terms: Policy Gradient Theorem, RLHF, john-schulman, Reinforcement Learning
Discussed in:
- Chapter 1: What Is AI?, A Brief History of AI