15.18 Evaluation in 2026
Benchmarks that defined the 2020–2024 era are saturated. New benchmarks have emerged. Contamination, test sets leaking into training data, has become a first-order concern.
The saturation of MMLU and friends
MMLU (Hendrycks et al., 2021), 57 subjects, multiple choice. Human expert: ~89%. By 2024, frontier models exceeded 90%; by 2025, the top of the pack was over 92%. The benchmark no longer discriminates between strong models.
MMLU-Pro (Wang et al., 2024) replaced multiple-choice with 10-option questions and harder distractors. Top models in early 2026 reach the low 80s, providing some discrimination but already showing signs of saturation.
GSM8K, grade-school math word problems. Solved (above 95%) by frontier models since 2023.
HellaSwag, ARC, WinoGrande, TruthfulQA, all saturated or near-saturated.
The new hard benchmarks
- GPQA (Rein et al. 2023), Google-Proof Q&A. PhD-level questions in physics, chemistry and biology, designed so that non-experts with internet access cannot solve them. Frontier models in early 2026 reach the high 70s to low 80s; PhDs in the same field score around 65%.
- AIME, American Invitational Mathematics Examination problems. In 2023, a frontier model could reliably solve perhaps 10% of problems. By 2025, o1, o3, and R1 were solving 80–90%. The benchmark is largely saturated.
- FrontierMath (Glazer et al. 2024), research-level mathematics problems, deliberately constructed to require minutes-to-hours of expert effort. As of April 2026, frontier reasoning models score around 50 per cent on Tier 1-3 problems and 30-40 per cent on Tier 4 problems (April 2026: GPT-5.5 Pro 52.4 per cent on Tier 1-3, 39.6 per cent on Tier 4; Opus 4.7 22.9 per cent on Tier 4).
- ARC-AGI (Chollet, 2019) and ARC-AGI-2, abstract reasoning grids. ARC-AGI-1 was substantially solved by o3 in late 2024 (frontier > 85%); ARC-AGI-2, harder, stands at 30–40% in early 2026.
- Humanity's Last Exam (HLE, 2025), 3000+ questions across the breadth of human academic disciplines, many with closed-form correct answers. Designed explicitly to be the last frontier benchmark. Top models in early 2026 reach the high 20s in percentage; humans average single digits.
- SWE-Bench and SWE-Bench Verified, real GitHub issues, with the model expected to produce a patch that passes tests. Verified subset: top systems in April 2026 reach 80-85 per cent on SWE-Bench Verified (GPT-5.5 ~83 per cent, Opus 4.7 ~82 per cent), though SWE-Bench Verified is now widely considered contaminated and OpenAI now reports SWE-Bench Pro instead.
- OSWorld, VisualWebArena, TheAgentCompany, long-horizon agentic benchmarks. Performance remains 30–60% range, the active growth frontier.
Contamination
The contamination crisis came to a head in 2024. Many widely-cited results turned out to be contaminated: test problems verbatim or paraphrased in pre-training data. The most damning evidence was that models could often predict held-out test questions when prompted appropriately, indicating they had seen them.
Mitigations:
- Held-out construction: benchmarks created after a model's training cut-off cannot be contaminated by it. This is what gives FrontierMath and HLE much of their value.
- Decontamination at training time: filter pre-training and SFT data against known test sets. The technique can only be applied to benchmarks that exist when training begins; benchmarks released after the training cut-off are protected by held-out construction (the previous bullet) rather than by decontamination.
- Private benchmarks: hold the test set behind an API and allow only black-box queries. Several major benchmarks now offer this mode.
- Dynamic benchmarks: regenerate problems on each evaluation. LiveCodeBench, MathVista, SWE-Bench Live.
By 2026, single-number benchmark scores are widely distrusted. Publication norms have shifted toward reporting on a portfolio of benchmarks, including private and dynamic ones, with explicit contamination analysis.
What saturates tells you
A subtle point. When a benchmark saturates, the simple interpretation is "models have become as good as the test distribution." But saturation can also be a property of:
- the gap between best human and human floor, MMLU's expert ceiling is 89%, so models exceeding 90% have technically beaten experts but the headroom on the metric is small;
- the noise in the gold answers, many MMLU questions have ambiguous or arguably-wrong "correct" answers, so accuracy above ~95% is not meaningful;
- contamination, saturation can be partly an artefact of test data leaking into training data over years of public availability.
A useful rule of thumb: a benchmark saturates around the noise floor of its annotations. If the inter-annotator agreement is 92%, no model will reliably score above 92%; the distinction between 92.5% and 94.0% is noise.
Beyond benchmarks
The next frontier in evaluation is real-world impact: does the model actually help users do their work? Some efforts:
- Chatbot Arena (LMSys), pairwise human voting on model outputs. Now the most-cited single number in industry, despite its flaws (it measures stylistic preference more than capability).
- Real-world deployments with embedded telemetry, Cursor, Devin, Claude Code measure success rates on actual user tasks.
- Capability evaluations for AI safety (METR, AISI, Apollo), measure dangerous capabilities (autonomous self-replication, persuasion, cyberoffense) on standardised task suites.
We expect the trend to continue: evaluations will become more grounded in actual deployment, less reliant on closed-book exam formats.