HellaSwag, introduced by Zellers and colleagues in 2019, is a multiple-choice commonsense sentence-completion benchmark. Each item presents a short context (a sentence or two from a video caption or how-to article) followed by four candidate continuations; the model must pick the most plausible one. The trick is that the three distractor continuations were generated by a language model and then filtered by Adversarial Filtering, only continuations that fool a discriminator (BERT-style) but read as obviously wrong to humans are retained. The result is a benchmark on which humans score 95%+ but contemporary models in 2019 scored close to chance.
The dataset contains around 70,000 items sourced from ActivityNet captions and WikiHow articles. Train, validation, and test splits are public; the test set is held out on a leaderboard. Scoring is simple accuracy on a four-way multiple choice. Random baseline is 25%; humans average 95.6%.
Performance trajectory. When HellaSwag launched, BERT-Large scored 47.3% and humans 95.6%, a 48-point gap that motivated the paper's title "Can a Machine Really Finish Your Sentence?". GPT-2 (1.5B) reached 41%; GPT-3 175B (zero-shot) 78.9%. PaLM 540B crossed 83% in 2022. GPT-4 reached 95% in 2023. By 2024, all frontier models (GPT-4o, Claude 3.5, Gemini 1.5, Llama 3.1 405B) report 95–96%, matching human performance. The benchmark is now considered fully saturated.
Known issues. HellaSwag has been in pretraining corpora since 2019 and shows clear signs of contamination. The four-way multiple-choice format is also vulnerable to the now-classic "letter bias", models can over-prefer option A or D depending on prompting style. As an evaluation benchmark HellaSwag is dead; as a training signal (especially for small models) it is still useful for measuring early commonsense acquisition.
Modern relevance. HellaSwag is reported almost universally on small-model releases (Phi, Gemma, Llama 3.2 1B/3B, Mistral 7B) as one of seven or eight standard "broad capability" benchmarks. It is no longer a frontier signal.
Reference: Zellers et al., "HellaSwag: Can a Machine Really Finish Your Sentence?", ACL 2019.
Related terms: WinoGrande, ARC, MMLU
Discussed in:
- Chapter 7: Supervised Learning, Evaluation Metrics