FrontierMath, Glossary, Textbook of AI

FrontierMath, introduced by Epoch AI in November 2024, is a research-level mathematics benchmark designed to remain unsaturated for years. Each problem is commissioned from professional mathematicians, including Fields Medallists Terence Tao and Timothy Gowers, and IMO problem-setters. The problems span modern number theory, algebraic geometry, combinatorics, analysis, category theory, dynamical systems, and mathematical physics at a level equivalent to late PhD coursework or early-stage research.

Problems have integer or computable-expression answers so that grading is automatic, but the path to the answer requires substantial mathematical insight that cannot be retrieved from the literature. Each problem is reviewed by at least one independent expert and is required to be non-trivial even with full Mathematica/Sage assistance. The dataset is secret, neither problems nor solutions are published, and Epoch AI runs evaluations against held-out models on its own infrastructure to prevent contamination.

The benchmark is segmented into three tiers by estimated difficulty: Tier 1 (advanced undergraduate / early graduate), Tier 2 (graduate research), and Tier 3 (genuinely hard research mathematics). Around 25% of the dataset is Tier 3.

Performance trajectory. At launch in November 2024, GPT-4o scored 2%, Claude 3.5 Sonnet 2%, and Gemini 1.5 Pro 2%. OpenAI o1-preview scored 1%. OpenAI o3, announced December 2024, scored 25.2% with high-compute test-time inference, a 12× jump that stunned the math community. By mid-2025, Gemini 2.5 Pro Deep Think reached 30%, Grok 4 35%, and OpenAI o3-pro with extended reasoning crossed 45% on Tiers 1–2. Tier 3 remains overwhelmingly unsolved by all frontier systems (best around 15%).

Known issues. Because the problems are secret, third parties cannot verify Epoch AI's grading. There has been controversy about OpenAI's funding relationship with Epoch (OpenAI partially funded the benchmark and had access to the problems for training-data exclusion), which raises optical concerns even if the numerical results are accurate. The benchmark is also small (~300 problems), making single-problem swings worth ~0.3 percentage points.

Modern relevance. FrontierMath is the most-watched mathematical reasoning benchmark of 2025, replacing AIME and MATH at the frontier. It is widely cited in AI-2027 discussions and AGI-timeline analyses as the leading indicator of automated mathematical research capability.

Reference: Glazer et al. (Epoch AI), "FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI", arXiv 2024; epochai.org/frontiermath.

Related terms: AIME, MATH, Humanity's Last Exam, OpenAI o3, o1 / Reasoning Models

Discussed in:

Chapter 7: Supervised Learning, Evaluation Metrics

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).