DeepSeek R1-Zero, Glossary, Textbook of AI

DeepSeek R1-Zero is a reasoning model released by DeepSeek on 20 January 2025, alongside its production sibling R1. Its scientific contribution is a clean demonstration that reasoning can emerge from reinforcement learning alone, applied directly to a pre-trained base model with no intervening supervised fine-tuning.

Recipe. Starting from DeepSeek-V3-Base, the team applied Group Relative Policy Optimisation (GRPO) with only two reward signals:

Accuracy rewards for mathematics and code, computed by checking final answers and unit tests.
Format rewards that gave credit when the model placed its reasoning between <think> tags and its answer between <answer> tags.

There was no reward model, no human preference data, and no SFT cold-start. Training proceeded in a single RL stage.

Emergent behaviour. As training progressed, the model spontaneously developed several behaviours associated with deliberate reasoning:

Reasoning traces grew from a few hundred tokens to several thousand, autonomously allocating more compute to harder problems.
Self-verification: the model began checking its own intermediate steps and revising them.
Reflection and backtracking: the model would pause, recognise an error, and restart a sub-derivation. The DeepSeek paper highlighted an "aha moment" mid-training where reflection emerged sharply.

On AIME 2024, R1-Zero rose from 15.6% pass@1 in the base model to 71% over training, and to 86.7% with majority voting.

Limitations. R1-Zero's outputs were technically strong but stylistically rough: it mixed languages within a single trace, omitted explanations, and was hard for users to read. The R1 release added a small amount of curated cold-start chain-of-thought data and a second RL stage with helpfulness and harmlessness rewards, fixing the readability problems while preserving most of the reasoning gains.

Significance. R1-Zero is the most influential open replication of the reasoning-training paradigm previewed by OpenAI o1 and o3. By keeping the recipe minimal and releasing weights, training code and the technical report, DeepSeek let the rest of the field study and copy the method. Within weeks, dozens of teams reproduced the result on smaller bases, and by mid-2025 RL-on-reasoning had become a standard final stage in frontier post-training pipelines.

The episode also reinforced a broader point: verifiable reward signals (mathematics, code, formal proofs) are sufficient to bootstrap reasoning in capable base models. The challenge for less verifiable domains, such as open-ended writing or scientific judgement, became the next frontier.

R1-Zero is freely downloadable and has been distilled into smaller open models widely used for self-hosted reasoning.

Video

Discussed in:

Chapter 15: Modern AI, Modern AI

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).