SQuAD, Glossary, Textbook of AI

SQuAD (Stanford Question Answering Dataset), introduced by Rajpurkar and colleagues in 2016, is the canonical extractive reading-comprehension benchmark that defined the modern QA paradigm. Each item presents a Wikipedia paragraph (~120 words on average) plus a question authored by a crowdworker; the answer is a span of text copied verbatim from the paragraph. The original SQuAD 1.1 release contained 107,785 questions over 536 Wikipedia articles.

SQuAD 2.0, released in 2018, added 53,775 unanswerable questions, questions for which no answer span exists in the paragraph but which superficially resemble answerable ones. Models must learn to abstain ("no answer") rather than confidently extract a wrong span. SQuAD 2.0 became the standard variant from 2018 onward; SQuAD 1.1 saturated almost immediately.

Scoring uses Exact Match and F1 score on the predicted span vs the gold answer span (with normalisation for whitespace, articles, and punctuation). Human performance on SQuAD 1.1 is EM 82.3% / F1 91.2%; on SQuAD 2.0 EM 86.8% / F1 89.5%.

Performance trajectory. The 2016 baseline (logistic regression with hand-crafted features) scored EM 40 / F1 51. The first BiDAF and DCN systems crossed F1 80 in 2017. BERT-Large in 2018 hit F1 91.0 on SQuAD 1.1, exceeding the human ceiling for the first time. SQuAD 2.0's no-answer twist briefly held the line, early BERT scored F1 81, but was crossed by ALBERT and RoBERTa within a year. By 2020 the SQuAD 2.0 leaderboard's top systems (XLNet, ALBERT-xxlarge, ELECTRA) had all exceeded human F1 of 89.5%. Modern frontier LLMs (GPT-4, Claude 3.5, Gemini) score F1 92–95% zero-shot without any task-specific training.

Known issues. SQuAD has been fully saturated since 2019 and is no longer a discriminative benchmark. As one of the most widely-distributed NLP datasets in history, it is universally present in pretraining corpora, every modern LLM has effectively memorised large parts of it. The extractive-span format also limits the questions to those whose answers literally appear in the paragraph, which excludes most interesting reasoning.

Modern relevance. SQuAD's primary legacy is historical and architectural: it shaped the encoder-only transformer era (BERT, RoBERTa, ALBERT, ELECTRA), drove the development of attention-pooling architectures, and remains a teaching benchmark for NLP courses worldwide. It is rarely reported on modern LLM model cards.

Reference: Rajpurkar et al., "SQuAD: 100,000+ Questions for Machine Comprehension of Text", EMNLP 2016; Rajpurkar et al., "Know What You Don't Know: Unanswerable Questions for SQuAD", ACL 2018.

Related terms: DROP, GLUE and SuperGLUE, F1 Score

Discussed in:

Chapter 7: Supervised Learning, Evaluation Metrics

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).