AIME, Glossary, Textbook of AI

The American Invitational Mathematics Examination is a 3-hour, 15-question (or 30-question across both papers) competition for the top ~5% of US high-school mathematicians. Each answer is an integer between 000 and 999, eliminating multiple-choice luck. Problems are deliberately designed to require creative insight rather than mechanical computation: number-theoretic identities, clever counting, geometric constructions, and short proofs reduced to a numerical answer.

From early 2024 onwards, the AI community adopted AIME 2024 (15 problems) and AIME 2025 (15 problems) as a frontier reasoning benchmark. The appeal is threefold: the problems are released after most pretraining cutoffs (so contamination is initially low), the integer-answer format is unambiguous, and scoring out of 30 (across both papers) is easy to communicate. Standard reporting is pass@1 under chain-of-thought, sometimes with majority-of-N voting (cons@64) for stronger models.

Performance trajectory. When OpenAI announced o1 in September 2024, it scored 74.4% on AIME 2024 (against GPT-4o's 13.4%) , the largest single benchmark jump in modern LLM history. DeepSeek-R1 reached 79.8%, matching o1. OpenAI o3 announced in December 2024 reached 96.7%. Gemini 2.5 Pro Deep Think crossed 88% in mid-2025. Grok 4 and Claude 4.5 Sonnet (extended thinking) reported 95–97% by late 2025. Comparable elite human competitors (AIME qualifiers) average 5–7 out of 15, so frontier models now substantially exceed top high-school olympiad performance.

Known issues. AIME problems and solutions are aggressively re-shared online within days, so by mid-2025 the 2024 paper was considered partially contaminated for any model trained after that point. AIME 2025 (released February 2025) became the cleaner benchmark for the rest of that year. Some labs have been criticised for fine-tuning on past AIME problems and inflating scores. The 30-problem total also makes the metric noisy, a single problem swing is 3.3 percentage points.

Modern relevance. AIME is now the most-cited mathematical reasoning benchmark on frontier model launches (OpenAI, Anthropic, Google, xAI, DeepSeek, Qwen). It was the headline number that defined the reasoning-model era.

Reference: AIME problems and solutions are published annually by the Mathematical Association of America (MAA).

Discussed in:

Chapter 7: Supervised Learning, Evaluation Metrics

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).