DROP (Discrete Reasoning Over Paragraphs), introduced by Dua and colleagues at AI2 in 2019, is a reading-comprehension benchmark that deliberately requires arithmetic, counting, sorting, and date manipulation rather than pure span extraction. Each item presents a short paragraph (typically a Wikipedia excerpt about an NFL game, historical event, or census record) plus a question whose answer requires combining multiple facts from the paragraph.
Example: "How many years passed between the first and last battle?", the model must locate two dates in the paragraph and subtract. Other questions require counting events of a particular type, ranking entities by some attribute, or extracting and combining multiple spans.
The dataset contains 96,567 questions over 6,735 Wikipedia paragraphs, split into train/dev/test. Answers come in three types: numbers, dates, and spans (one or more text spans from the paragraph). Scoring uses Exact Match (EM) and a numeric/span-aware F1 that handles the answer types appropriately.
Performance trajectory. Original 2019 baseline systems (BiDAF, BERT) scored around 30–40% F1, with the SQuAD-style EM well below the human upper bound of 96%. Specialised systems (NABERT, NumNet+) climbed to 80% by 2021. GPT-3 (175B) few-shot reached 70.4% F1. GPT-4 crossed 80.9% at release with chain-of-thought. Claude 3.5 Sonnet reported 87.1%. Frontier 2025 models (o1, o3, Claude 4 Opus, Gemini 2.5 Pro) cluster at 90–93% F1, approaching but not quite matching the human ceiling.
Known issues. DROP's automatic grading struggles with answer normalisation: numeric answers can be expressed as digits ("3"), words ("three"), or with units ("3 years"), and the F1 grader has to reconcile these, sometimes incorrectly. Dataset analyses have also shown a disproportionate share of NFL-game paragraphs, which biases the benchmark toward sports trivia rather than general reasoning. As with all 2019-era benchmarks, training-set contamination is presumed.
Modern relevance. DROP is no longer a frontier benchmark but is still reported on most general-capability model cards. It pioneered the discrete-reasoning over text category that later motivated the math-reasoning benchmarks GSM8K and MATH.
Reference: Dua et al., "DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs", NAACL 2019.
Discussed in:
- Chapter 7: Supervised Learning, Evaluation Metrics