Glossary

MMLU-Pro

MMLU-Pro was released by TIGER-Lab (University of Waterloo) in 2024 as a direct response to the saturation and contamination of the original MMLU. It rebuilds the multiple-choice format with three substantive changes: ten answer options instead of four (reducing the random baseline from 25% to 10% and making lucky guessing less helpful), more reasoning-heavy questions sourced from textbooks and STEM examinations rather than recycled web content, and aggressive filtering of items that frontier models could solve without thinking.

The dataset contains around 12,000 questions across 14 disciplines (biology, business, chemistry, computer science, economics, engineering, health, history, law, mathematics, philosophy, physics, psychology, and other). Questions are vetted by domain experts and stress chain-of-thought reasoning rather than recall. Scoring uses standard accuracy under a 5-shot or zero-shot CoT prompt.

Performance trajectory. Initial 2024 evaluations placed GPT-4o at roughly 72.6%, Claude 3.5 Sonnet at 76.1%, and open-weights models well behind (Llama 3.1 405B around 73%). By mid-2025, OpenAI o1 reached around 83%, o3 and Gemini 2.5 Pro crossed 85%, and Claude 4 Opus clustered in the same band. The gap between MMLU and MMLU-Pro scores has narrowed as reasoning models matured but remains a useful headroom indicator: a model that scores 92% on MMLU but only 78% on MMLU-Pro is leaning on shallow recall.

Known issues. MMLU-Pro is harder, not contamination-proof. Evidence of partial leakage has surfaced for some textbook-sourced items. The benchmark is also still purely multiple-choice, so it cannot evaluate open-ended generation, calibration of refusals, or long-form reasoning. Some critics argue the ten-option format penalises models that produce almost-correct options as distractors.

Modern relevance. MMLU-Pro is now a standard line on frontier model cards (OpenAI, Anthropic, Google DeepMind, Meta) and is widely used by LiveBench and academic comparison studies as a less-saturated drop-in for MMLU. It is the recommended default for any 2025–2026 model evaluation that wants a broad academic-knowledge headline number.

Reference: Wang et al., "MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark", NeurIPS 2024 Datasets and Benchmarks.

Related terms: MMLU, GPQA, LiveBench, o1 / Reasoning Models

Discussed in:

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).