Glossary

STaR (Self-Taught Reasoner)

STaR (Self-Taught Reasoner), Zelikman, Wu, Mu and Goodman (Stanford / Google Research, 2022), is the technique that first showed an LLM could bootstrap its own reasoning ability with no extra human-written rationales. It is the conceptual precursor of every post-2024 reasoning-model training pipeline, including the rejection-sampling stages of DeepSeek-R1 and the rationale-filtering loops behind Llama 3 instruction tuning.

The setup assumes a dataset of (problem $x$, final-answer $y$) pairs but no human-written reasoning chains. The basic loop is:

  1. Rationale generation. For each problem $x$, prompt the current model to produce a chain of thought $z$ followed by an answer $\hat{y}$:

    $$z, \hat{y} \sim \pi_\theta(\cdot \mid x).$$

  2. Rejection filtering. Keep only the (problem, rationale) pairs where the model's answer matched the ground truth, $\hat{y} = y$. These are the "good" rationales, the model got the right answer, and the chain of thought that led there is plausibly correct.

  3. Rationalisation (the STaR twist). For problems where the model failed, prompt it with the correct answer and ask it to generate a rationale that leads to that answer. Keep these post-hoc rationales as well, treating them as supervised hints.

  4. Fine-tune the model on the union of (filtered chains + rationalised chains), producing $\pi_{\theta'}$.

  5. Repeat. Re-generate, re-filter, re-fine-tune. Each iteration extends the set of problems for which the model produces correct chains.

The headline empirical claim of STaR was that on arithmetic, commonsense reasoning (CommonsenseQA), and multi-step word problems, a few iterations of this loop produced gains comparable to fine-tuning on tens of thousands of human-written rationales, using zero new human labels beyond the original answer keys. On a 6B-parameter base model, STaR lifted GSM8K accuracy from ~5% to ~36% over 3 iterations, approaching the fine-tuned-on-human-rationales ceiling.

The "rationalisation" trick is the critical insight. Without it, the model can only improve on problems it can already partially solve, easy problems get reinforced and hard problems are abandoned. With rationalisation, the model is taught how to reason towards answers it would not have found on its own, expanding the reasoning frontier into new territory.

Lineage. STaR is the direct ancestor of Quiet-STaR (Zelikman et al., 2024, internal monologue at every token), V-STaR (verifier-guided STaR), and the rejection-sampling fine-tuning (RFT) stages used in WizardMath, OpenMathInstruct, and DeepSeek-Math. The post-2024 reasoning-model pipeline can be read as STaR + RL + verifiable rewards + scale: the same iterative-rationale-filter loop, but with stronger filters (Lean kernel, unit tests) and with grpo replacing the SFT step.

Limitations. STaR cannot bootstrap from a model that has no chance of solving any problem (the filter returns the empty set on round 1), it requires a base model already in the relevant capability range. It also concentrates on the few problem types the base model can already partially handle, which is one motivation for moving to rule-based verifiers and large synthetic curricula in later systems. Despite these limits, STaR remains the cleanest demonstration that reasoning can be self-taught when correctness can be checked.

Related terms: Chain-of-Thought, Self-Distillation, Synthetic Data for Reasoning, o1 / Reasoning Models, Process Supervision, Verifiable Rewards

Discussed in:

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).