GLUE and SuperGLUE, Glossary, Textbook of AI

GLUE (General Language Understanding Evaluation), introduced by Wang and colleagues in 2018, is a suite of nine NLP tasks assembled to provide a single standardised report card for natural-language-understanding systems. The tasks span sentiment (SST-2), paraphrase (MRPC, QQP), textual entailment (MNLI, RTE, WNLI), semantic similarity (STS-B), acceptability (CoLA), and question pair classification. Scoring averages per-task metrics (accuracy, F1, Matthews correlation, Pearson) into a single GLUE score out of 100.

GLUE saturated within 18 months of release: BERT-Large (2018) hit 80.5 on the leaderboard, and by mid-2019 ensemble systems exceeded the human baseline of 87.1. This rapid saturation prompted the same authors to release SuperGLUE in 2019, a harder eight-task suite with more complex reasoning (BoolQ, CB, COPA, MultiRC, ReCoRD, RTE, WiC, WSC) and a higher human ceiling of 89.8.

Performance trajectory. GLUE: BERT-Base (2018) 78.3 → BERT-Large 80.5 → RoBERTa 88.5 → T5-11B 90.3 → DeBERTa-v3 91.1. Human baseline 87.1 was first exceeded in mid-2019. SuperGLUE: BERT-Large (2019) 69.0 → T5-11B 89.3 → DeBERTa-v3 91.4, the human baseline of 89.8 was crossed in 2021. By 2023, modern LLMs (GPT-4, Claude 2) matched or exceeded the SuperGLUE ceiling zero-shot, without any fine-tuning.

Known issues. Both GLUE and SuperGLUE were designed for the encoder-only fine-tuning paradigm of 2018–2020, not for instruction-tuned generative LLMs. The aggregated score papers over very different per-task difficulties (CoLA's Matthews correlation is famously noisy; WSC has only 285 test items). Like every public NLP benchmark, contamination is endemic.

Modern relevance. GLUE and SuperGLUE are now historical artefacts: they defined the BERT era and the single-suite idea that HELM and BIG-Bench later expanded, but they are essentially never reported on modern LLM cards. Their lasting contribution is methodological, the practice of bundling NLU tasks into a single comparable score.

Reference: Wang et al., "GLUE: A Multi-Task Benchmark", ICLR 2019; Wang et al., "SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems", NeurIPS 2019.

Related terms: SQuAD, MMLU, HELM

Discussed in:

Chapter 7: Supervised Learning, Evaluation Metrics

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).