15.7 GRPO and reasoning-model training

In the closing months of 2024 a new kind of language model started landing on benchmark leaderboards and, more importantly, in the hands of ordinary users. OpenAI shipped o1, then o3. DeepSeek released R1. Anthropic added an "extended thinking" mode to Claude. Google followed with Gemini 2 thinking. What these systems have in common is that they pause before answering. Ask one of them a hard maths problem and, instead of typing the first plausible reply, it produces page after page of internal scratch-work, false starts, second thoughts, half-finished derivations, the occasional "wait, that cannot be right", and only then commits to a final answer.

The trick that makes this work is reinforcement learning with verifiable rewards, and the open-weight algorithm that put it within reach of the wider community is GRPO (Group Relative Policy Optimisation), introduced by DeepSeek. Where §15.6 covered DPO (fine-tuning from human preferences without a reward model), this section turns to a complementary problem: how do you train a model when the task has an objectively correct answer, a maths solution, a passing unit test, a Lean-checked proof, and you would like the model to keep working at it until it gets there?

Symbols Used Here
$\pi_\theta$the policy (the language model whose weights $\theta$ we are updating)
$r$a verifiable reward, usually 1 for a correct answer and 0 for a wrong one
$y_1, \ldots, y_K$a group of $K$ different completions sampled from the policy for the same prompt
$\hat A_i$the advantage of completion $i$, telling the optimiser how much better than average it was

Why reasoning RL is different from RLHF

Reinforcement learning from human feedback (RLHF), the topic of §15.5, is a fundamentally subjective business. Two annotators read two model replies and pick which one they prefer. The preferences are aggregated into a learned reward model, a neural network that has internalised what humans tend to like, and the policy is then optimised against that learned reward. This works well for chatty, helpful, harmless behaviour, where there is no single right answer and the goal is to capture taste. It works less well when the question is "what is the integral of $x \sin x$?" because taste does not enter into it. The answer is right or it is not.

Reasoning RL flips the source of the reward. Instead of asking a human, or a network trained on human preferences, you ask the world. For arithmetic, you parse the model's final answer and check it. For a coding problem, you compile and run the candidate solution against a hidden test suite. For a formal proof, you hand it to Lean and watch for a tick or a cross. The reward is a single bit, correct or not correct, and it is exact. There is no Bradley–Terry, no preference annotation, no learned reward network. The environment is the reward.

This shift has two consequences that turn out to matter enormously in practice. The first is that the signal is untrickable. With a learned reward model, a sufficiently clever policy can find phrasings that look like good answers without being good answers, the classic reward-hacking failure mode. With a verifier, fluent nonsense fails, full stop; only an actually correct solution earns the bit. The second is scale. Human preference data costs roughly £1 per comparison once you account for annotator wages, instructions, quality control and disagreement resolution. Verifier data costs almost nothing, a Python interpreter is happy to grade a billion attempts overnight. That scaling property is what allowed DeepSeek and others to run RL for far more episodes than RLHF had ever allowed, and it is the lever that produced the long-form thinking we now associate with reasoning models.

Two boundaries are worth flagging early. Verifiable rewards only make sense for tasks where a verifier exists. Maths, code, formal logic, board games, certain science problems, certain legal queries, yes. "Write a moving short story about loss", no, you are back in RLHF or DPO territory. And even in verifiable domains the reward is sparse: you find out only at the end whether the answer was right, with nothing said about the hundreds of intermediate steps. Section 15.9 will return to this point under the heading of process supervision.

A useful analogy is the difference between marking a child's arithmetic homework and judging an essay competition. The arithmetic teacher does not need taste, opinions, or training; she has the answer key. She can mark a thousand sums an hour and never be wrong about which are right. The essay judge has none of those advantages. Both kinds of feedback are useful, but they require very different machinery, and a model trained against one is not automatically good under the other. Reasoning RL exploits the arithmetic-teacher setting wherever it can be found.

The GRPO algorithm

GRPO is best understood as a slimmed-down cousin of PPO (proximal policy optimisation), the workhorse of RLHF. PPO needs a value network, a second neural model trained to estimate how good a partial answer is, so that it can compute how much better than expected the final reward turned out. Training a value network alongside the policy roughly doubles memory and adds a second source of approximation error. GRPO does away with it.

The recipe is short:

  1. For each prompt $x$ in the training batch, sample a group of $K$ completions $y_1, \ldots, y_K$ from the current policy. (Typical $K$ is 8 or 16.)
  2. Run each completion through the verifier and record its reward $r_i$. Most rewards are binary, but you can mix in continuous shaping terms, a small bonus for using the right output format, say, or for matching the answer's language to the question's language.
  3. Compute the group-relative advantage for each completion:

$$ \hat A_i = \frac{r_i - \mathrm{mean}(r_1, \ldots, r_K)}{\mathrm{std}(r_1, \ldots, r_K) + \epsilon}. $$

  1. Update the policy weights with a PPO-style clipped objective, using $\hat A_i$ as the advantage and a small KL penalty against the original (pre-RL) reference model so the policy does not drift too far from sensible language.

The intuition behind step 3 is that, within a single group of attempts at the same prompt, the average reward acts as a baseline. Completions that beat the average get a positive advantage and the optimiser nudges the policy towards them. Completions below average get a negative advantage and the policy is nudged away. Dividing by the standard deviation rescales the signal so that hard prompts (low average) and easy prompts (high average) contribute comparably.

A useful corner case: if all $K$ completions in a group earn the same reward, all eight wrong or all eight right, the standard deviation is zero, every advantage is undefined, and that prompt contributes no gradient at all. This is GRPO's natural curriculum. Problems the model already aces add no signal; problems it cannot crack at all add no signal; only problems on the edge of competence drive learning. In practice this means the training data filters itself: as the policy improves, easier prompts drop out and harder ones step forward, without any explicit difficulty labels.

The headline practical benefit is memory. Removing the value network frees roughly a quarter of the GPU memory that PPO would consume. On a 70-billion-parameter base model, that is the difference between fitting on a single eight-GPU node and needing two.

To make the algorithm concrete, imagine a single prompt: "What is the smallest positive integer divisible by both 6 and 8?" Sample $K = 8$ completions and grade them. Suppose four come back with the correct answer (24) and four with various wrong answers, giving a reward vector $\mathbf{r} = (1, 0, 1, 1, 0, 0, 1, 0)$. The mean is 0.5, the standard deviation is 0.5, and the resulting advantage vector is $\hat{\mathbf{A}} = (1, -1, 1, 1, -1, -1, 1, -1)$. The optimiser nudges the weights to make the four correct completions a little more probable next time and the four wrong ones a little less. No value network was involved; the eight peers acted as each other's baseline.

What "thinking" looks like

The most surprising empirical finding of the DeepSeek-R1 paper was not the algorithm, GRPO is mechanically straightforward, but the behaviour the algorithm induced. Apply GRPO with verifiable rewards to a strong base model, run it for long enough, and the model spontaneously starts producing very long chains of thought before giving its final answer. Not because anyone trained it on long chains of thought, but because longer thinking earned more reward, and the optimiser is patient.

A short excerpt of what one of these traces looks like, paraphrased from a real R1 maths problem:

Let me try setting $x = 2$. That gives $4 - 6 + 5 = 3$, not zero. So $x = 2$ is not a root. Let me try $x = 1$: $1 - 3 + 5 = 3$, still not zero. Hmm. Wait, I wrote the polynomial down wrong. Let me re-read the question. … Yes, the constant term is $-5$, not $+5$. Trying again with $x = 1$: $1 - 3 - 5 = -7$. Trying $x = -1$: $-1 - 3 + 5 = 1$. Closer. Let me try the rational-root theorem properly. …

The hallmarks are clear: the model proposes a step, evaluates whether it worked, sometimes catches its own arithmetic mistakes, occasionally backtracks several lines, and the length of the trace scales with the difficulty of the problem. Easy questions get answered in a sentence; competition-level questions can run to thousands of tokens of internal monologue. None of this was hand-engineered. It emerged from the simple fact that, when the verifier is the judge, more careful work pays off.

Test-time compute scaling

Reasoning models gave us a new dial to turn. Until 2024, the conventional way to get a better answer from a language model was to train a bigger one, a longer, more expensive process that could only happen at six-monthly intervals or slower. Reasoning models added a second axis: spend more compute at inference. Let the model think for longer before it answers.

The most-cited demonstration was OpenAI's o3 result on ARC-AGI, a benchmark of visual abstract-reasoning puzzles. Under normal compute, o3 scored around 30%, comparable to a strong baseline. Allowed to think for far longer per puzzle, running thousands of internal samples, then aggregating, it scored 87.5%, comfortably above the 85% human-baseline threshold the benchmark had been designed around. The same model, with no weight changes, swung from middling to superhuman by being given more time.

This is genuinely a new scaling axis. The classical scaling laws (§15.1) tell you what happens when you grow parameters and training tokens together; the test-time compute curve tells you what happens when you keep the model fixed and let it think longer. For users it means a deployment choice: a quick, cheap reply, or a longer, more expensive but more reliable one. For research it raises an open question, how far does this curve go, and which problems does it help most on? Maths and code benefit hugely. Open-ended writing barely shifts. The book returns to this question in §15.8.

Process supervision versus outcome supervision

Verifiable rewards as we have described them are an example of outcome supervision: only the final answer is graded. An alternative is process supervision (Lightman et al., 2023), in which a human or a checker grades each step of the reasoning. Process supervision is more expensive, somebody has to label every line, but it produces models whose intermediate reasoning is more reliable, not just the final answer. Outcome supervision can reward a model that gets the right answer for the wrong reason; process supervision cannot. In practice the frontier labs use a mixture: cheap outcome rewards to drive the bulk of training, more expensive process rewards on a smaller, carefully curated subset.

The trade-off becomes especially visible in domains where intermediate honesty matters. A model that arrives at the right diagnosis through dodgy reasoning is dangerous in clinical practice even if its final answer is correct. Process supervision is one of the few tools we have for catching that failure mode during training rather than after deployment, and it is one of the active research frontiers as of 2026.

Where reasoning RL is used

By the spring of 2026 reasoning RL has become routine across the frontier. OpenAI ships the o-series. DeepSeek's R1 weights and methods are public, and the recipe has been replicated by half a dozen open-weight teams. Anthropic's "extended thinking" mode in Claude is a reasoning variant. Google's Gemini 2 has a thinking mode. Almost every serious lab now offers two flavours of model, a fast one for chat and a slow one for hard problems, and the slow one is invariably trained with some descendant of the techniques in this section.

The economic shape of the technology is also worth noting. Because verifier data is essentially free at the margin, reasoning RL has been one of the most accessible frontier techniques for academic labs and small companies to replicate. You still need a strong base model to start from, but if you have one, GRPO on a few hundred thousand maths and code prompts will get you a credible reasoning variant without the millions of dollars of human labelling that RLHF demands. This is part of why open-weight reasoning models have, briefly and unusually, kept pace with the closed-weight frontier.

What you should take away

  1. Reasoning models are language models trained with reinforcement learning against verifiable rewards, correctness checkers rather than human preference judges.
  2. GRPO is the open algorithm that drives most of this. It strips PPO down by replacing the value network with a simple group-mean baseline, computing the advantage of each completion relative to its peers on the same prompt.
  3. Long internal chains of thought emerged from GRPO training without being directly supervised; they are the model's discovered strategy for earning reward.
  4. Test-time compute is now a scaling axis in its own right: the same weights can be cheap and quick or slow and far more accurate, depending on how long you let the model think.
  5. Verifiable rewards are powerful but bounded, they need a verifier. Outside maths, code and formal logic, you are back in RLHF or DPO territory, possibly augmented with process supervision on whatever steps you can grade.

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).