Elliot Glazer, Ege Erdil, Tamay Besiroglu, Diego Chicharro, Evan Chen, Alex Gunning, Caroline Falkman Olsson, Jean-Stanislas Denain, Anson Ho, Emily de Oliveira Santos, Olli Järviniemi, Matthew Barnett, Robert Sandler, Jaime Sevilla, Tom Tseng, Yuki Hayashi, Maxim Kapur, Pieter Garrelfs, Carolyn Ashbaugh, & others (2024), References, Textbook of AI

Elliot Glazer, Ege Erdil, Tamay Besiroglu, Diego Chicharro, Evan Chen, Alex Gunning, Caroline Falkman Olsson, Jean-Stanislas Denain, Anson Ho, Emily de Oliveira Santos, Olli Järviniemi, Matthew Barnett, Robert Sandler, Jaime Sevilla, Tom Tseng, Yuki Hayashi, Maxim Kapur, Pieter Garrelfs, Carolyn Ashbaugh, & others (2024)

arXiv:2411.04872.

URL: https://arxiv.org/abs/2411.04872

Abstract. Epoch AI's research-mathematics benchmark. Comprises hundreds of original problems written by mathematicians at research institutions, designed to require minutes to hours of expert effort and to admit only verifiable numerical or symbolic answers. The problems span number theory, combinatorics, analysis and algebraic geometry. At launch (late 2024) frontier models scored under 2%; by April 2026 the strongest reasoning models with extensive thinking budgets reach 25-35%. FrontierMath has become the canonical hard-mathematics evaluation, replacing saturated benchmarks like MATH and GSM8K.

Tags: benchmark reasoning mathematics

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).

FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI