David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, & Samuel R. Bowman (2023), References, Textbook of AI

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, & Samuel R. Bowman (2023)

arXiv:2311.12022.

URL: https://arxiv.org/abs/2311.12022

Abstract. A graduate-level multiple-choice benchmark in physics, chemistry and biology, designed to be Google-proof, non-expert humans with full internet access score around 34%, while domain PhDs in the same field score around 65%. The 448 questions were written by domain experts and validated by independent experts. GPQA replaced earlier saturated benchmarks (MMLU, ARC) as the standard hard-knowledge benchmark for frontier LLM evaluation; frontier models in early 2026 reach the high 70s to low 80s.

Tags: benchmark reasoning language-models

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).

GPQA: A Graduate-Level Google-Proof Q&A Benchmark