WinoGrande, introduced by Sakaguchi and colleagues in 2020, is a 44,000-item large-scale extension of the classic Winograd Schema Challenge (WSC) for commonsense pronoun resolution. Each item presents a short sentence with a pronoun and two candidate antecedents; resolving the pronoun requires world knowledge or commonsense inference that goes beyond syntax.
A canonical example: "The trophy doesn't fit in the brown suitcase because it is too small." Does it refer to the trophy or the suitcase? Swapping small for large flips the correct answer, these twin sentence pairs are the WSC's key trick, and WinoGrande preserves the structure at scale. The dataset uses Adversarial Filtering (similar to HellaSwag) to remove items solvable by surface heuristics like word co-occurrence.
Splits include WinoGrande-XS, S, M, L, XL (training set sizes from 160 to 40,398) plus a 1,267-item validation set and a 1,767-item test set. Scoring is binary accuracy on the test set, reported on the AI2 leaderboard. Random baseline is 50%; human performance is 94%.
Performance trajectory. GPT-2 large scored around 66% in 2020. GPT-3 (175B) zero-shot reached 70.2%. PaLM 540B crossed 85% in 2022. GPT-4 reached 87.5% at release. Llama 3.1 405B and Claude 3.5 Sonnet report 88–89%. The benchmark has not quite saturated to the human ceiling, frontier 2025 models cluster at 89–92%, but it is no longer a discriminative signal.
Known issues. WinoGrande has been on the open web since 2020. Several papers have shown that some twin pairs are not perfectly counterfactual , the easy member of a pair can leak information about the hard member through training co-occurrence. The benchmark is also small per evaluation cycle (1,767 items) and answers are binary, making single-question swings worth ~0.06 percentage points.
Modern relevance. WinoGrande remains a standard line on small-model release reports (Phi, Gemma, Llama 3.2 1B/3B). At the frontier it is essentially solved.
Reference: Sakaguchi et al., "WinoGrande: An Adversarial Winograd Schema Challenge at Scale", AAAI 2020.
Discussed in:
- Chapter 7: Supervised Learning, Evaluation Metrics