David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, & Samuel R. Bowman (2023)
arXiv:2311.12022.
URL: https://arxiv.org/abs/2311.12022
Abstract. A graduate-level multiple-choice benchmark in physics, chemistry and biology, designed to be Google-proof, non-expert humans with full internet access score around 34%, while domain PhDs in the same field score around 65%. The 448 questions were written by domain experts and validated by independent experts. GPQA replaced earlier saturated benchmarks (MMLU, ARC) as the standard hard-knowledge benchmark for frontier LLM evaluation; frontier models in early 2026 reach the high 70s to low 80s.
Tags: benchmark reasoning language-models