Synthetic Data for Reasoning, Glossary, Textbook of AI

Synthetic data for reasoning refers to the practice of generating artificial training examples, problem statements, solutions, reasoning chains, and using them to train or fine-tune LLMs on reasoning tasks. It is the data-side counterpart to self-play-verifiable-rewards and is what enables modern reasoning models to be trained on essentially unlimited mathematics and code corpora despite the limited supply of human-written examples.

Synthetic data pipelines exploit a simple asymmetry: in domains with verifiable-rewards (math, code, formal proofs), generating candidate solutions is cheap and verifying them is exact. So a strong model can produce many candidate solutions to a problem, the verifier filters them, and the surviving (problem, correct-solution) pairs become high-quality training data, without any human labelling.

Several archetypes. Solution generation from existing problems: take an existing problem set (MATH, GSM8K, AoPS, Codeforces), generate 50–500 chain-of-thought solutions per problem with a strong base model, filter by final-answer correctness or test-passing, keep the survivors. This is the recipe behind WizardMath, MAmmoTH, OpenMathInstruct.

Problem generation, harder but more scalable: prompt a strong model to invent new problems within a topic ("generate a hard combinatorics problem suitable for IMO shortlist"), have it solve them, and keep the (problem, solution) pairs that are non-trivially solvable. AlphaGeometry's training set was built this way: a symbolic generator produced millions of synthetic geometry configurations, the symbolic solver attempted them, and the solvable-but-non-trivial ones became training data.

Bootstrapped rationale generation (star): take a problem set with answers but no rationales, generate rationales with the current model, keep those that lead to the correct answer, fine-tune. Iterate. This is how the chain-of-thought ability is amplified across generations.

The AlphaProof flywheel is the most ambitious instance. Starting from Mathlib (Lean's mathematical library) and a corpus of human-translated competition problems, AlphaProof's neural prover proposed Lean tactics, the Lean kernel verified them, and accepted proofs were added to the corpus. The next iteration's model trained on this expanded corpus could attempt harder problems. Over many iterations the corpus grew to roughly 100 million proof-relevant examples, none of which existed in the human-written literature.

Quality filtering is the central engineering challenge. Loose filters preserve scale but inject errors; tight filters preserve accuracy but shrink the corpus. Common practice is multi-stage: (1) verifier filter (must reach correct answer / pass tests / type-check), (2) diversity filter (deduplicate near-duplicate solutions), (3) difficulty filter (remove trivial cases that do not stress the model), (4) length and formatting filter. The Phi series of models from Microsoft demonstrated that aggressive synthetic-data curation could match much larger models on reasoning at a fraction of the parameters.

Risks. Without verifier filtering, synthetic data can collapse a model, repeated training on its own unfiltered outputs degrades capability ("model autophagy"). Unverifiable synthetic data (e.g. for general dialogue) is much trickier than verifiable synthetic data. And synthetic data's gains do not necessarily transfer outside its generated domain, a model fine-tuned on synthetic competition math may not improve on natural-language reasoning unless the chains of thought transfer.

Despite the risks, synthetic data is now the dominant training signal for the reasoning-capable layer of frontier models. The combination of a strong base model + a deterministic verifier + a curriculum of problems is the most productive data engine the field has yet found.

Discussed in:

Chapter 16: Ethics & Safety, Synthetic Data and the Reasoning Flywheel

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).