GPQA, Glossary, Textbook of AI

GPQA (Graduate-level Google-Proof Q&A) was introduced by Rein and colleagues in late 2023 as a deliberately difficult science benchmark. Each question is written by a domain expert holding a PhD in physics, chemistry, or biology, and is then validated by a second PhD-holding expert in the same field. To earn its name, every question is also given to non-expert validators with PhDs in other domains and unrestricted access to the internet, only items that those highly-skilled but out-of-field humans failed to solve in 30 minutes of search are kept.

The full set contains 448 questions. The most-cited slice is GPQA Diamond: 198 of the hardest items, where both expert validators agreed the question was unambiguous and high-quality. Questions are multiple-choice (typically four options). Scoring is accuracy on Diamond. Domain experts working in their own field score around 65%; non-experts with web access score around 34%.

Performance trajectory. GPT-4 scored roughly 39% on Diamond at release in 2023, barely above the search-equipped non-expert baseline. Claude 3 Opus reached 50.4% in early 2024. Claude 3.5 Sonnet crossed 59.4% in June 2024. The first reasoning models pushed the ceiling sharply: OpenAI o1 reached 78%, o3 crossed 87.7% (December 2024), and DeepSeek-R1 reached 71.5%. By late 2025, o3-pro, Claude 4.5 Sonnet, Gemini 2.5 Pro, and Grok 4 all cluster between 85% and 92%, surpassing the in-field expert baseline of 65%.

Known issues. GPQA is small (198 Diamond items), so a single misgraded question shifts the score by 0.5 percentage points. The multiple-choice format means a model that "knows it doesn't know" still has a 25% floor. Several questions have been spotted on the open web after release, raising contamination concerns for any model trained after early 2024.

Modern relevance. GPQA Diamond is the headline science-reasoning benchmark on every 2025 model launch. It replaced MMLU as the headline "hard knowledge" number for reasoning models because it cannot be solved by retrieval alone, questions require multi-step technical inference.

Reference: Rein et al., "GPQA: A Graduate-Level Google-Proof Q&A Benchmark", COLM 2024.

Related terms: MMLU, OpenAI o3, Claude 4 Family, o1 / Reasoning Models, FrontierMath

Discussed in:

Chapter 7: Supervised Learning, Evaluation Metrics

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).