Glossary

PPO

Proximal Policy Optimization (PPO), introduced by John Schulman et al. (2017), is the dominant policy-gradient algorithm for deep RL. PPO uses a clipped surrogate objective that prevents overly large policy updates within each step.

Let $\rho_t(\theta) = \pi_\theta(a_t | s_t) / \pi_{\theta_\mathrm{old}}(a_t | s_t)$ be the importance-sampling ratio between the new policy and the old. PPO maximises

$$\mathcal{L}^{\mathrm{PPO}}(\theta) = \mathbb{E}_t\!\left[\min\!\bigl(\rho_t(\theta) \hat A_t, \, \mathrm{clip}(\rho_t(\theta), 1 - \varepsilon, 1 + \varepsilon) \hat A_t\bigr)\right]$$

where $\hat A_t$ is an advantage estimate (typically GAE) and $\varepsilon \approx 0.1$ to $0.3$ is the clip range. The clip prevents $\rho_t$ from moving too far from 1, limiting how much the policy can change per update.

Combined objective with value-function loss and entropy bonus:

$$\mathcal{L}^{\mathrm{total}} = \mathcal{L}^{\mathrm{PPO}} - c_1 (V_\theta(s_t) - V_t^\mathrm{target})^2 + c_2 H[\pi_\theta(\cdot | s_t)]$$

Training: $K$ epochs of mini-batch SGD over each rollout (typically $K = 4$ to $10$, batch size 64-256).

PPO replaced TRPO (Trust Region Policy Optimization, Schulman 2015) which used a hard KL constraint requiring expensive natural-gradient computation. PPO achieves comparable performance with simple SGD and clipping.

Standard hyperparameters (Stable Baselines defaults): $\varepsilon = 0.2$, $\gamma = 0.99$, GAE $\lambda = 0.95$, learning rate 3e-4, $c_1 = 0.5$, $c_2 = 0.01$.

PPO is the workhorse of RLHF training of large language models, InstructGPT, ChatGPT, Claude (early), Gemini, LLaMA-Chat all use PPO or close variants. GRPO (DeepSeek 2024) replaced PPO's value baseline with group-relative advantages, simplifying the pipeline for reasoning-model training.

Video

Related terms: Policy Gradient Theorem, RLHF, john-schulman, Reinforcement Learning

Discussed in:

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).