GSM8K, Glossary, Textbook of AI

GSM8K (Grade School Math 8K), released by Cobbe and colleagues at OpenAI in 2021, is a collection of 8,500 high-quality multi-step arithmetic word problems at roughly the level of a Year 5–8 (US grades 5–8) student. Each problem is stated in natural language, requires between 2 and 8 elementary arithmetic operations to solve, and ships with a fully written-out chain-of-thought solution ending in a numeric answer flagged by ####.

The dataset is split into 7,473 training examples and 1,319 test examples. Standard evaluation extracts the final number from the model's generation and compares it (with appropriate normalisation for fractions, units, and trailing zeros) to the gold answer. Scoring is exact-match accuracy, sometimes reported alongside pass@n using majority voting (self-consistency) over n samples.

Performance trajectory. GSM8K is one of the few benchmarks whose history maps neatly onto the rise of chain-of-thought prompting. A 6B-parameter GPT-3 finetuned with verifiers reached around 55% in the 2021 release paper. PaLM 540B with chain-of-thought prompting hit 58% in 2022. GPT-4 crossed 92% in 2023 with CoT. By 2024, Claude 3.5 Sonnet, Llama 3.1 405B, Qwen 2.5 72B, and GPT-4o all reported scores in the 94–96% range, and OpenAI o1 and DeepSeek-R1 sit at 96–97%. The benchmark is now considered fully saturated, remaining errors largely involve label noise or genuinely ambiguous questions.

Known issues. Multiple studies have shown that rephrasing GSM8K problems (e.g. swapping names, units, or numerical values) drops accuracy by 5–15 points on smaller models, suggesting some answers come from memorisation rather than reasoning. The dataset has appeared verbatim in many web-scraped training corpora since 2021. The OpenAI team eventually released GSM-Symbolic and GSM-Plus as contamination-resistant variants.

Modern relevance. GSM8K is largely obsolete as a frontier benchmark, it serves as a smoke test for new small models (3B–8B) and as a quick check that fine-tuning hasn't broken basic arithmetic. Frontier mathematical reasoning is now reported on MATH, AIME, and FrontierMath.

Reference: Cobbe et al., "Training Verifiers to Solve Math Word Problems", arXiv 2021.

Related terms: MATH, AIME, Chain-of-Thought, FrontierMath

Discussed in:

Chapter 7: Supervised Learning, Evaluation Metrics

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).