References

FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI

Elliot Glazer, Ege Erdil, Tamay Besiroglu, Diego Chicharro, Evan Chen, Alex Gunning, Caroline Falkman Olsson, Jean-Stanislas Denain, Anson Ho, Emily de Oliveira Santos, Olli Järviniemi, Matthew Barnett, Robert Sandler, Jaime Sevilla, Tom Tseng, Yuki Hayashi, Maxim Kapur, Pieter Garrelfs, Carolyn Ashbaugh, & others (2024)

arXiv:2411.04872.

URL: https://arxiv.org/abs/2411.04872

Abstract. Epoch AI's research-mathematics benchmark. Comprises hundreds of original problems written by mathematicians at research institutions, designed to require minutes to hours of expert effort and to admit only verifiable numerical or symbolic answers. The problems span number theory, combinatorics, analysis and algebraic geometry. At launch (late 2024) frontier models scored under 2%; by April 2026 the strongest reasoning models with extensive thinking budgets reach 25-35%. FrontierMath has become the canonical hard-mathematics evaluation, replacing saturated benchmarks like MATH and GSM8K.

Tags: benchmark reasoning mathematics

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).