Modern AI: 15.18   Evaluation in 2026

Dr Chris Paton

15.18 Evaluation in 2026

Benchmarks that defined the 2020–2024 era are saturated. New benchmarks have emerged. Contamination, test sets leaking into training data, has become a first-order concern.

The saturation of MMLU and friends

MMLU (Hendrycks et al., 2021), 57 subjects, multiple choice. Human expert: ~89%. By 2024, frontier models exceeded 90%; by 2025, the top of the pack was over 92%. The benchmark no longer discriminates between strong models.

MMLU-Pro (Wang et al., 2024) replaced multiple-choice with 10-option questions and harder distractors. Top models in early 2026 reach the low 80s, providing some discrimination but already showing signs of saturation.

GSM8K, grade-school math word problems. Solved (above 95%) by frontier models since 2023.

HellaSwag, ARC, WinoGrande, TruthfulQA, all saturated or near-saturated.

The new hard benchmarks

GPQA (Rein et al. 2023), Google-Proof Q&A. PhD-level questions in physics, chemistry and biology, designed so that non-experts with internet access cannot solve them. Frontier models in early 2026 reach the high 70s to low 80s; PhDs in the same field score around 65%.
AIME, American Invitational Mathematics Examination problems. In 2023, a frontier model could reliably solve perhaps 10% of problems. By 2025, o1, o3, and R1 were solving 80–90%. The benchmark is largely saturated.
FrontierMath (Glazer et al. 2024), research-level mathematics problems, deliberately constructed to require minutes-to-hours of expert effort. As of April 2026, frontier reasoning models score around 50 per cent on Tier 1-3 problems and 30-40 per cent on Tier 4 problems (April 2026: GPT-5.5 Pro 52.4 per cent on Tier 1-3, 39.6 per cent on Tier 4; Opus 4.7 22.9 per cent on Tier 4).
ARC-AGI (Chollet, 2019) and ARC-AGI-2, abstract reasoning grids. ARC-AGI-1 was substantially solved by o3 in late 2024 (frontier > 85%); ARC-AGI-2, harder, stands at 30–40% in early 2026.
Humanity's Last Exam (HLE, 2025), 3000+ questions across the breadth of human academic disciplines, many with closed-form correct answers. Designed explicitly to be the last frontier benchmark. Top models in early 2026 reach the high 20s in percentage; humans average single digits.
SWE-Bench and SWE-Bench Verified, real GitHub issues, with the model expected to produce a patch that passes tests. Verified subset: top systems in April 2026 reach 80-85 per cent on SWE-Bench Verified (GPT-5.5 ~83 per cent, Opus 4.7 ~82 per cent), though SWE-Bench Verified is now widely considered contaminated and OpenAI now reports SWE-Bench Pro instead.
OSWorld, VisualWebArena, TheAgentCompany, long-horizon agentic benchmarks. Performance remains 30–60% range, the active growth frontier.

Contamination

The contamination crisis came to a head in 2024. Many widely-cited results turned out to be contaminated: test problems verbatim or paraphrased in pre-training data. The most damning evidence was that models could often predict held-out test questions when prompted appropriately, indicating they had seen them.

Mitigations:

Held-out construction: benchmarks created after a model's training cut-off cannot be contaminated by it. This is what gives FrontierMath and HLE much of their value.
Decontamination at training time: filter pre-training and SFT data against known test sets. The technique can only be applied to benchmarks that exist when training begins; benchmarks released after the training cut-off are protected by held-out construction (the previous bullet) rather than by decontamination.
Private benchmarks: hold the test set behind an API and allow only black-box queries. Several major benchmarks now offer this mode.
Dynamic benchmarks: regenerate problems on each evaluation. LiveCodeBench, MathVista, SWE-Bench Live.

By 2026, single-number benchmark scores are widely distrusted. Publication norms have shifted toward reporting on a portfolio of benchmarks, including private and dynamic ones, with explicit contamination analysis.

What saturates tells you

A subtle point. When a benchmark saturates, the simple interpretation is "models have become as good as the test distribution." But saturation can also be a property of:

the gap between best human and human floor, MMLU's expert ceiling is 89%, so models exceeding 90% have technically beaten experts but the headroom on the metric is small;
the noise in the gold answers, many MMLU questions have ambiguous or arguably-wrong "correct" answers, so accuracy above ~95% is not meaningful;
contamination, saturation can be partly an artefact of test data leaking into training data over years of public availability.

A useful rule of thumb: a benchmark saturates around the noise floor of its annotations. If the inter-annotator agreement is 92%, no model will reliably score above 92%; the distinction between 92.5% and 94.0% is noise.

Beyond benchmarks

The next frontier in evaluation is real-world impact: does the model actually help users do their work? Some efforts:

Chatbot Arena (LMSys), pairwise human voting on model outputs. Now the most-cited single number in industry, despite its flaws (it measures stylistic preference more than capability).
Real-world deployments with embedded telemetry, Cursor, Devin, Claude Code measure success rates on actual user tasks.
Capability evaluations for AI safety (METR, AISI, Apollo), measure dangerous capabilities (autonomous self-replication, persuasion, cyberoffense) on standardised task suites.

We expect the trend to continue: evaluations will become more grounded in actual deployment, less reliant on closed-book exam formats.