Group Relative Policy Optimization, Glossary, Textbook of AI

Group Relative Policy Optimization (GRPO) is the policy-gradient algorithm introduced by DeepSeek in the DeepSeekMath paper (Shao et al., 2024) and used as the workhorse of deepseek-r1-zero and DeepSeek-R1. It is a memory-efficient variant of ppo that eliminates the value (critic) network by estimating advantages from group-relative comparisons within a batch of completions to the same prompt.

Standard PPO requires a value function $V_\phi(s)$ that is roughly the same size as the policy $\pi_\theta$, doubling activation memory and adding a separate forward/backward pass per step. For LLMs at 7B+ parameters this is a serious constraint. GRPO observes that for prompted generation, all completions to a given prompt $x$ share the same baseline expected return, so the advantage of completion $i$ can be estimated by z-scoring the rewards within the group rather than from a learned baseline.

Concretely, for each prompt $x$ in the batch, sample $G$ completions $\{\tau_1, \dots, \tau_G\}$ from the current policy and obtain rewards $\{r_1, \dots, r_G\}$. Define the group-relative advantage

$$\hat{A}_i = \frac{r_i - \mathrm{mean}(r_1, \dots, r_G)}{\mathrm{std}(r_1, \dots, r_G) + \epsilon}.$$

The same advantage is broadcast to every token in $\tau_i$ (or shaped per-step if a prm is available). The GRPO objective then mirrors PPO's clipped surrogate:

$$\mathcal{L}_\mathrm{GRPO}(\theta) = \mathbb{E}_{(x, \tau_i)}\left[ \min\Big( \rho_t(\theta) \hat{A}_i,\ \mathrm{clip}(\rho_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_i \Big) - \beta \, \mathrm{KL}(\pi_\theta \| \pi_\mathrm{ref}) \right],$$

where $\rho_t(\theta) = \pi_\theta(a_t | s_t) / \pi_{\theta_\mathrm{old}}(a_t | s_t)$ is the importance ratio and $\pi_\mathrm{ref}$ is a frozen reference (typically the SFT model) used for KL regularisation. The clipping $\epsilon \approx 0.2$ and the KL penalty $\beta$ are PPO-standard.

The savings are substantial. The value-function memory overhead of PPO, roughly equal to the policy itself for activations, is gone. A single 7B-parameter run that requires 8× A100s under PPO can fit on 4× A100s under GRPO. This is what made DeepSeek-R1-Zero feasible to train at the open-source compute scale: hundreds of thousands of RL steps on a 671B-parameter MoE base, with the verifier (math/code rule-based reward) supplying the reward signal.

GRPO's main weakness is variance when $G$ is small. The z-score estimate is noisy with group size $G < 4$, and degenerate when all $G$ completions earn the same reward (the standard deviation is zero). Practical implementations use $G = 16$ to $G = 64$ and skip degenerate groups. The KL term is also load-bearing, without it, the policy drifts far from the reference and quality collapses on tasks not covered by the verifier.

GRPO has become the de-facto open-source RL algorithm for LLM reasoning. The TRL, OpenRLHF and verl libraries all ship GRPO implementations, and most post-2024 reasoning fine-tunes (Qwen-QwQ, Llama-3-Thinker, open replications of R1) use GRPO rather than PPO precisely because the memory savings make it tractable on commodity hardware.

Discussed in:

Chapter 16: Ethics & Safety, Policy Optimisation for LLMs

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).