MMMU (Massive Multi-discipline Multimodal Understanding), released by Yue and colleagues in 2024, is the multimodal answer to MMLU. It contains 11,500 college-level questions across 30 subjects and 6 disciplines (Art and Design, Business, Science, Health and Medicine, Humanities and Social Science, Tech and Engineering). Every question pairs a natural-language stem with one or more images: charts, plots, anatomical diagrams, chemistry structures, music notation, engineering schematics, archaeological photographs, and so on. Models must integrate visual perception with subject-matter reasoning to answer.
The dataset contains 183 subfields and uses a mixture of multiple-choice and open-ended formats. The headline split is MMMU validation (900 hand-vetted items) and MMMU test (10,500 items, leaderboard-only). A harder companion, MMMU-Pro (2024), filters questions that text-only models can solve from the question stem alone, i.e. it forces genuine visual grounding, and adds an OCR-stripped variant where the answer text is rasterised into an image to defeat shortcut-learning.
Performance trajectory. GPT-4V at release in late 2023 scored 56.8% on the MMMU validation set. Claude 3 Opus reached 59.4%. Gemini 1.5 Pro 62.2%. GPT-4o 69.1% at launch. Claude 3.5 Sonnet (new) 70.4% in October 2024. OpenAI o1 crossed 78%. By late 2025, Gemini 2.5 Pro, Claude 4 Opus, and GPT-5 all cluster at 81–87% on validation. MMMU-Pro scores are typically 10–15 points lower (Claude 4 Opus around 72%), making it the more discriminative variant. Domain experts score around 88.6% , frontier models are now within reach of expert-human performance.
Known issues. Image quality varies (some scanned diagrams are barely legible). The OCR-stripped MMMU-Pro variant has exposed that several "vision" benchmark items can be solved by reading text rendered into the image, a perception trick rather than visual reasoning. Like every public benchmark, contamination is presumed for any post-2024 model.
Modern relevance. MMMU is the standard headline multimodal-reasoning benchmark on 2024–2026 model launches (GPT-5, Claude 4, Gemini 2.5, Llama 4 Vision, Qwen2.5-VL). It plays the role for vision-language models that MMLU plays for text-only models, and is gradually being supplemented by MathVista, ChartQA, and the multimodal track of Humanity's Last Exam.
Reference: Yue et al., "MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark", CVPR 2024.
Related terms: MMLU, ImageNet (ILSVRC), Humanity's Last Exam
Discussed in:
- Chapter 7: Supervised Learning, Evaluation Metrics