ARC-AGI (the Abstraction and Reasoning Corpus) was introduced by François Chollet in his 2019 paper On the Measure of Intelligence. Each task consists of three or four input/output grid pairs that demonstrate a transformation rule, plus a held-out test input on which the model must produce the correct output grid. Grids are coloured, vary in size from 1×1 to 30×30, and the rules involve concepts like symmetry, gravity, object counting, recolouring, completion, copying, and connectivity. Each task is novel, the rule cannot be looked up and there is no opportunity to memorise.
Crucially, ARC-AGI was designed as an anti-pretraining benchmark: it tests fluid intelligence, the ability to acquire a new skill from a handful of examples, rather than crystallised knowledge. Humans solve roughly 80% of public-set tasks; the original 2019 best deep-learning system scored under 5%. Tasks are graded all-or-nothing: every cell of every output grid must match exactly.
The benchmark is split into a public training set (400 tasks), a public evaluation set (400 tasks), and a private evaluation set (100 secret tasks held by the ARC Prize organisation). The private set is the headline score reported by the ARC Prize leaderboard.
Performance trajectory. For five years (2019–2024) the benchmark resisted neural approaches. Best published methods plateaued around 20–30% using program-synthesis and DSL-search. Jack Cole and Mohamed Osman's neural-search hybrid reached 34% in 2024 and won that year's ARC Prize. The breakthrough came in December 2024: OpenAI o3 scored 75.7% on the semi-private evaluation set with low compute and 87.5% with high compute (more than $10,000 of inference per task), the first system to clear the 85% AGI threshold Chollet had specified. o3's success came at extreme inference cost and was demonstrated on the v1 benchmark; ARC-AGI-2 was released in March 2025 with harder tasks specifically designed to resist o3-style brute-force search and currently sits at frontier scores below 20%.
Known issues. o3's high-compute solution required massive inference budgets (estimated $1k–$10k per task) that no production system can sustain. Chollet himself emphasised that o3 had not "solved" intelligence, rather, it had cracked the v1 benchmark, motivating the v2 redesign. The all-or-nothing grading is also unusually noisy: a one-pixel error scores zero.
Modern relevance. ARC-AGI remains the most-cited fluid intelligence benchmark. The 2025 ARC Prize ($1 million purse) drove substantial public interest in compute-efficient reasoning architectures.
Reference: Chollet, "On the Measure of Intelligence", arXiv 2019; ARC Prize Foundation, https://arcprize.org.
Related terms: OpenAI o3, o1 / Reasoning Models, Chain-of-Thought
Discussed in:
- Chapter 7: Supervised Learning, Evaluation Metrics