Self-Distillation, Glossary, Textbook of AI

Self-distillation (also iterative refinement, iterative self-improvement) is the training pattern in which generation $n+1$ of a model is trained on filtered outputs produced by generation $n$. It exploits the fact that a strong model's best samples are often higher quality than its average sample, and that fine-tuning on those best samples concentrates the policy on them, raising the floor.

The basic loop is: (1) freeze the current model $M_n$; (2) sample many candidate completions for a curriculum of prompts; (3) filter the candidates with a quality criterion, a verifier, a reward model, a heuristic, or a stronger judge model; (4) supervise-fine-tune $M_n$ on the filtered set to obtain $M_{n+1}$; (5) repeat. This is sometimes called rejection sampling fine-tuning (RFT) when the filter is a verifier, self-training when the filter is a teacher model, or expert iteration in the RL literature.

Mathematically, if $\pi_n$ is the current policy and $f$ is the filter, the next policy minimises

$$\mathcal{L}_{n+1}(\theta) = -\mathbb{E}_{x \sim \mathcal{D}, y \sim \pi_n(\cdot | x), f(x,y) = 1} \log \pi_\theta(y | x).$$

Under reasonable conditions this is equivalent to projecting $\pi_n$ onto the subset of trajectories the filter accepts, a one-step policy improvement under the indicator reward $f$.

Several flagship 2024–2025 systems use self-distillation centrally.

Llama 3 instruction tuning. Meta reports that the post-training pipeline iterated 6 rounds of: sample on instruction prompts, score with reward models, retain the top responses, SFT on the curated set, then DPO. Each round measurably lifted Arena ELO.

DeepSeek-R1. The R1 pipeline starts with R1-Zero (pure RL from base) then distils its long reasoning chains into a separate SFT-then-DPO model (R1) that is more user-friendly. The distillation step is exactly self-distillation from a same-family teacher.

AlphaProof's data flywheel. The Lean-acceptance verifier filters proof attempts, the accepted ones are added to the training corpus, the next iteration's model is re-trained on the enlarged corpus, and the new model proposes proofs the previous one could not. After many rounds the corpus contains millions of proofs none of which could have been found at iteration 1.

STaR / star is the canonical earlier instance: bootstrap rationales by generating chains of thought, filter for those that reach the correct answer, fine-tune on the filtered set.

The failure modes are well-studied. Mode collapse: each round narrows the distribution; over many rounds the model becomes confidently wrong on tail topics. Hallucinated confidence: filters that select for plausibility (rather than correctness) reinforce the model's biases. Distillation drift: when the teacher and student are the same family, idiosyncratic teacher errors become systematic student errors. Mitigations are diversity injection (mixing in human data, sampling at high temperature), independent verifier filters (verifiable rewards beat learned filters), and bounding the number of rounds.

Self-distillation is also the mechanism behind synthetic-data-only training: when the filter is strong enough (e.g. a Lean kernel), the model can in principle improve indefinitely from its own samples without human labels. Whether this generalises beyond formally-verifiable domains is one of the central questions of post-2024 frontier research.

Discussed in:

Chapter 16: Ethics & Safety, Iterative Self-Improvement

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).