The AI2 Reasoning Challenge (ARC), introduced by Clark and colleagues at the Allen Institute in 2018, is a multiple-choice grade-school science benchmark drawn from US standardised tests for grades 3 through 9. The dataset contains 7,787 questions, partitioned into:
- ARC-Easy (5,197 questions): items that a baseline retrieval-plus-co-occurrence solver could answer.
- ARC-Challenge (2,590 questions): items that defeated both the retrieval baseline and a word co-occurrence baseline, they require multi-step reasoning, qualitative physics, or causal inference rather than pure pattern matching.
Each question has four answer choices. Scoring is accuracy, sometimes weighted by question difficulty. ARC-Challenge is the discriminative split; ARC-Easy saturated quickly.
Performance trajectory. When released, the best 2018 system scored around 27% on ARC-Challenge, barely above the 25% random baseline. The first big jump came with retrieval-augmented BERT systems (40–55%) in 2019–2020. GPT-3 (175B) scored 51.4% zero-shot. GPT-4 crossed 96% at release. By 2024, all frontier models (Claude 3 Opus, Llama 3.1 405B, GPT-4o, Gemini 1.5 Pro) cluster at 96–97%. ARC-Easy is universally above 98%.
Known issues. Like HellaSwag and MMLU, ARC has been on the open web since 2018 and contamination is essentially guaranteed for modern pretraining corpora. The benchmark also conflates two skills: factual recall (which dominates ARC-Easy) and qualitative scientific reasoning (which dominates ARC-Challenge). Some critics argue several "Challenge" items have multiple defensible answers and should be removed.
Modern relevance. ARC is now a low-cost smoke test for new small models, similar to HellaSwag and WinoGrande, and routinely appears in the standard "open LLM leaderboard" tuple. Note that ARC-AGI (a completely different benchmark by François Chollet) shares only the acronym , the two should not be confused. ARC's lasting historical contribution was demonstrating that retrieval-only baselines could solve "easy" reading comprehension while genuinely-reasoned items resisted shallow systems for years, setting the template for the easy / challenge split that recurs in MMLU, GPQA, and SWE-Bench.
Reference: Clark et al., "Think You Have Solved Question Answering? Try ARC", arXiv 2018.
Related terms: HellaSwag, WinoGrande, ARC-AGI, MMLU
Discussed in:
- Chapter 7: Supervised Learning, Evaluation Metrics