The Massive Multitask Language Understanding benchmark, introduced by Hendrycks and colleagues in 2020, measures how well a language model can answer factual and reasoning questions drawn from 57 academic and professional subjects. Topics span elementary mathematics, US history, computer science, law, medicine, philosophy, abstract algebra, microeconomics, and clinical knowledge. Questions are sourced from real exams (GRE, USMLE, Advanced Placement, professional licensing) and edited into a uniform four-option multiple-choice format.
The dataset contains roughly 15,908 questions, partitioned into a small few-shot development set (5 questions per subject) and a held-out test set. Standard evaluation uses 5-shot prompting: the model sees five labelled examples from the same subject before answering each test question. Scoring is simple accuracy, the proportion of questions for which the model's most likely option matches the gold answer. A random baseline scores 25%; well-educated humans average around 89.8% (the rate among subject-matter experts on their own field).
Performance trajectory. When MMLU was released, GPT-3 (175B) reached roughly 44%, only modestly above chance on hard subjects like college-level mathematics or professional law. Chinchilla (70B) reached 67.5% in 2022. GPT-4 reported 86.4% in March 2023, the first system to clearly exceed expert humans averaged across all subjects. Claude 3.5 Sonnet reached 88.7% in mid-2024. Claude 3 Opus, Gemini 1.5 Pro, and Llama 3.1 405B all clustered between 86% and 89%. OpenAI o1 (the first reasoning model) crossed 92%, and o3 and Gemini 2.5 Pro approached 93–94% by late 2025, within striking distance of the human ceiling.
Known issues. MMLU is now widely regarded as saturated: the remaining errors are concentrated in a handful of contested or ambiguous questions, label noise, and minor format quirks. Several studies have documented training-set contamination, fragments of MMLU questions appear verbatim in CommonCrawl and in Stack Exchange dumps that are routinely scraped. The benchmark is also notoriously format-sensitive: small changes to option order, prompt wording, or the position of the answer letter can shift accuracy by several points. These concerns motivated the creation of harder, less contaminated successors (MMLU-Pro, GPQA, Humanity's Last Exam).
Modern relevance. Despite saturation, MMLU remains a default reporting line on every model card because it is cheap to run, well-understood, and historically comparable. New 2025 frontier launches (Claude 4 Opus, GPT-5, Gemini 3) still quote MMLU scores, but headline reasoning claims now defer to GPQA Diamond, AIME, FrontierMath, and SWE-Bench Verified.
Reference: Hendrycks et al., "Measuring Massive Multitask Language Understanding", ICLR 2021.
Related terms: MMLU-Pro, GPQA, Humanity's Last Exam, GPT-3, Claude 4 Family
Discussed in:
- Chapter 7: Supervised Learning, Evaluation Metrics