Humanity's Last Exam, Glossary, Textbook of AI

Humanity's Last Exam (HLE), introduced by the Center for AI Safety (CAIS) and Scale AI in January 2025, is a cross-disciplinary frontier benchmark designed as the successor to MMLU and MMLU-Pro. The dataset contains roughly 3,000 expert-written questions across dozens of academic and professional domains (mathematics, physics, chemistry, biology, medicine, law, history, classical languages, philosophy, computer science, mathematics olympiad problems, ancient texts, music theory, and more). Questions are deliberately set at a graduate-or-beyond difficulty, and many require integration across multiple specialties.

The benchmark was crowd-sourced from over 500 contributors (academic faculty, PhD students, and licensed professionals) responding to a public call for questions that frontier models could not solve. Each question was vetted by two independent experts. Some questions involve images (chemistry structures, physics diagrams, medical imaging); the benchmark therefore has a multimodal track in addition to its text-only track.

Scoring is exact-match for closed-form numeric or short-string answers, and a calibrated GPT-4-based grader for open-ended questions. The benchmark also reports calibration error alongside accuracy, a deliberate effort to penalise overconfident wrong answers.

Performance trajectory. At release in January 2025, GPT-4o scored 3.3%, Claude 3.5 Sonnet (new) 4.3%, Gemini 1.5 Pro 2.4%, OpenAI o1 9.1%, and DeepSeek-R1 8.5%. OpenAI o3 scored 20%. By late 2025, Gemini 2.5 Pro Deep Think reached 27%, Grok 4 31%, Claude 4.5 Sonnet 30%, and GPT-5 with high-compute test-time crossed 35%. The benchmark was deliberately designed to remain below 50% for at least 2026; current trajectories suggest crossing 50% by late 2026 or 2027.

Known issues. Like FrontierMath, HLE has been criticised for the involvement of frontier-lab funding (Scale AI has commercial relationships with all major labs). Some questions have proven ambiguous in expert review and are periodically retired. The 3,000-item size means single-question swings are worth ~0.03 percentage points, but the heterogeneous domain mix means the macro-average obscures wide per-domain variation.

Modern relevance. HLE has rapidly become the headline frontier-knowledge benchmark of 2025–2026, cited on every major model launch alongside GPQA, AIME, FrontierMath, and SWE-Bench Verified.

Reference: CAIS and Scale AI, "Humanity's Last Exam", arXiv 2025; agi.safe.ai.

Related terms: MMLU-Pro, GPQA, FrontierMath, OpenAI o3

Discussed in:

Chapter 7: Supervised Learning, Evaluation Metrics

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).