MATH, Glossary, Textbook of AI

The MATH benchmark, released by Hendrycks and colleagues in 2021, contains 12,500 problems drawn from US high-school mathematics competitions including AMC 10, AMC 12, and AIME. Topics span seven categories: prealgebra, algebra, number theory, counting and probability, geometry, intermediate algebra, and precalculus. Each problem is rated difficulty 1 (easiest) to 5 (hardest) and ships with a written, step-by-step worked solution. Answers are typically integers, simple fractions, or short closed-form expressions, written in LaTeX inside a \boxed{} macro.

The standard split is 7,500 training problems and 5,000 test problems. Evaluation uses chain-of-thought generation followed by automated extraction of the final boxed expression and a string-equivalence check against the gold answer (with normalisation for whitespace, equivalent fractions, and ordered answer pairs). Scoring is exact-match accuracy. Strong human competitors (top 25% of AMC participants) score around 40–60%; the original IMO-medallist co-author scored around 90%.

Performance trajectory. GPT-3 (175B) scored roughly 6.9% in the 2021 paper. Minerva 540B (a math-specialised PaLM variant) reached 50.3% in 2022. GPT-4 crossed 52.9% at release. By mid-2024, Claude 3.5 Sonnet reached 71%, Llama 3.1 405B 73.8%, and GPT-4o 76.6%. The reasoning-model jump was dramatic: OpenAI o1 reached 94.8%, o3 96.7%, DeepSeek-R1 97.3%, and Gemini 2.5 Pro >96%. The benchmark is now considered largely saturated at the frontier, and most leaderboards have shifted to AIME 2024/2025 and FrontierMath.

Known issues. As with GSM8K, contamination is a major concern, competition problems and their official solutions are widely indexed by Google and have appeared in many web-scraped corpora since 2021. The Minerva paper itself documented several leaked items. Format sensitivity also matters: switching the boxed-answer convention or the LaTeX flavour can drop scores by several points on weaker models.

Modern relevance. MATH remains a useful stratified benchmark (the difficulty 5 subset still discriminates), but it is no longer a frontier benchmark. New 2025 model launches report MATH for backwards compatibility while leading with AIME and FrontierMath.

Reference: Hendrycks et al., "Measuring Mathematical Problem Solving With the MATH Dataset", NeurIPS Datasets and Benchmarks 2021.

Related terms: GSM8K, AIME, FrontierMath, o1 / Reasoning Models

Discussed in:

Chapter 7: Supervised Learning, Evaluation Metrics

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).