AlphaProof is DeepMind's formal-proof reasoning system, announced in July 2024 alongside alphageometry-2 when the combined system achieved a silver-medal score (28/42) on the 2024 International Mathematical Olympiad. It is the highest-compute concrete instance of self-play-verifiable-rewards yet built and the canonical reference for what post-2024 mathematical reasoning models can in principle achieve given a perfect verifier.
The architecture has four major components.
The verifier: Lean 4. Every candidate proof step is submitted to Lean's kernel, which accepts or rejects it deterministically based on the rules of dependent type theory. The kernel is the verifiable-rewards source, there is no learned reward model, just typechecking. AlphaProof works in Lean 4 with Mathlib (the community formal mathematics library, ~1.5M lines) as the supporting context.
The policy: a Gemini-derived LLM. A large transformer fine-tuned to propose Lean tactics conditional on the current proof state. It outputs both natural-language commentary and Lean syntax; only the Lean syntax is checked.
The data engine: an autoformaliser plus 100M synthetic problems. The headline number is the dataset: roughly 100 million informal problem statements (translated from competition mathematics, textbooks, and synthetic generation) were autoformalised into Lean by a separate model, and the prover attempted them. Successes were added to the training corpus, growing the data flywheel iteratively. This is the same pattern as STaR but at AlphaZero compute scale.
The search: AlphaZero-style tree search with a learned value function. At inference time, AlphaProof does not greedily emit a single proof; it expands a tree of partial proof states, scored by a learned value function and explored via a UCT-like procedure inherited from muzero and AlphaZero. Each leaf is a partial proof state, each edge is a candidate tactic, and the search budget at IMO-time was reportedly several days per problem on a TPU pod.
The training loop combines expert iteration (collect successful proof attempts, fine-tune the policy on them) with RL via REINFORCE-style updates weighted by tree-search outcomes. The learned value function is updated to predict whether a given proof state is solvable within the search budget. Over many iterations, the policy concentrates on tactics that lead to verified proofs and the value function becomes a reliable difficulty estimate.
The IMO 2024 result is the empirical headline: AlphaProof solved problems P1, P2 and P6 of IMO 2024 (algebra and number theory), with AlphaGeometry 2 solving the geometry problem P4. The two combinatorics problems P3 and P5 were unsolved. Combined score 28/42, one point below the gold-medal threshold. Each problem took several days of TPU compute, far longer than a human contestant, an explicit demonstration of test-time-compute-scaling-laws in their most extreme form: thousands of GPU-hours per problem in exchange for human-grandmaster-level mathematical reasoning.
Implications. AlphaProof is a proof of concept that mathematical reasoning at the elite human level can be achieved by a sufficiently large reasoning-model trained on synthetic data filtered by a deterministic verifier. The remaining open questions are (i) compute efficiency , three days per IMO problem is impractical at scale, and (ii) transfer, does the formalised-mathematics ability transfer to natural-language mathematics or to other reasoning domains? Early evidence is mixed: distillates of AlphaProof-trained models do show improved general math reasoning, but the gap to natural-language fluency is large.
AlphaProof's design, Lean verifier, large synthetic data, neural search with value function, is now the blueprint for the next generation of formal-mathematics systems and a heavy influence on the architecture of openai-o3 and other frontier reasoners.
Related terms: AlphaGeometry 2, Self-Play on Verifiable Rewards, Verifiable Rewards, Synthetic Data for Reasoning, o1 / Reasoning Models, MuZero, OpenAI o3, Test-Time Compute Scaling Laws
Discussed in:
- Chapter 16: Ethics & Safety, AlphaProof and Formal Mathematics