Also known as: REINFORCE, policy-based RL
The policy gradient theorem (Sutton, McAllester, Singh, Mansour 2000) gives a tractable expression for the gradient of expected return with respect to policy parameters. For a policy $\pi_\theta(a | s)$ and the expected return objective
$$J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\!\left[\sum_{t=0}^\infty \gamma^t r_t\right]$$
the gradient is
$$\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\!\left[\sum_t \nabla_\theta \log \pi_\theta(a_t | s_t) \cdot G_t\right]$$
where $G_t = \sum_{k \geq 0} \gamma^k r_{t+k}$ is the return from time $t$. This is the foundation of every policy-gradient algorithm.
REINFORCE (Williams 1992) estimates the gradient by Monte Carlo sampling complete trajectories:
$$\hat \nabla J = \frac{1}{N} \sum_{n=1}^N \sum_t \nabla_\theta \log \pi_\theta(a_t^n | s_t^n) \cdot G_t^n$$
REINFORCE is unbiased but high-variance. Subtracting a baseline $b(s)$ reduces variance without biasing the estimator:
$$\hat \nabla J = \frac{1}{N} \sum_n \sum_t \nabla_\theta \log \pi_\theta(a_t^n | s_t^n) \cdot (G_t^n - b(s_t^n))$$
A learned value function $V_\phi(s)$ approximating $\mathbb{E}[G_t | s_t = s]$ is the standard baseline, giving the advantage
$$A(s_t, a_t) = G_t - V(s_t)$$
, the actor-critic method.
Generalised advantage estimation (GAE) (Schulman 2016) interpolates between high-bias one-step TD ($\lambda = 0$) and high-variance Monte Carlo ($\lambda = 1$):
$$\hat A^{\mathrm{GAE}(\gamma, \lambda)}_t = \sum_{l=0}^\infty (\gamma \lambda)^l \delta_{t+l}$$
where $\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$ is the TD error. GAE is the standard advantage estimator in modern policy-gradient implementations.
Trust Region Policy Optimization (TRPO) (Schulman 2015) enforces a KL constraint on policy updates:
$$\max_\theta \mathbb{E}\!\left[\frac{\pi_\theta(a|s)}{\pi_{\theta_{\mathrm{old}}}(a|s)} A(s, a)\right] \text{ s.t. } \mathbb{E}[D_\mathrm{KL}(\pi_{\theta_\mathrm{old}} \| \pi_\theta)] \leq \delta$$
TRPO's natural-gradient solution is computationally expensive; PPO (Schulman 2017) replaces the constraint with a clipped surrogate:
$$\mathcal{L}^{\mathrm{PPO}}(\theta) = \mathbb{E}\!\left[\min\!\bigl(\rho_t A_t, \mathrm{clip}(\rho_t, 1 - \varepsilon, 1 + \varepsilon) A_t\bigr)\right]$$
where $\rho_t = \pi_\theta / \pi_{\theta_\mathrm{old}}$ is the importance ratio. PPO's simplicity, robustness and effectiveness made it the dominant policy-gradient method, used in nearly every RLHF training pipeline through 2024.
GRPO (Group Relative Policy Optimization) (DeepSeek, 2024) eliminates the value function and computes advantage from a group of sampled responses:
$$A_i = \frac{r_i - \mathrm{mean}(r_1, \ldots, r_G)}{\mathrm{std}(r_1, \ldots, r_G)}$$
where $\{r_i\}$ are rewards for a group of $G$ samples for the same prompt. GRPO simplifies the RLHF pipeline and is the basis of DeepSeek-R1's reasoning training.
Video
Related terms: Reinforcement Learning, PPO, RLHF
Discussed in:
- Chapter 1: What Is AI?, A Brief History of AI