Glossary

BIG-Bench and BBH

BIG-Bench (Beyond the Imitation Game Benchmark), released in 2022 by a collaboration of more than 450 authors across 132 institutions, is a sprawling collection of 204 tasks designed to probe capabilities that single-task benchmarks miss. Tasks include logical deduction, causal reasoning, theory-of-mind, code reading, multi-step arithmetic, linguistics puzzles, ethics judgement, anachronism detection, and cryptic crossword solving. Many tasks come in both "lite" and "hard" variants. Scoring is per-task accuracy, exact-match, multiple-choice, BLEU, or human preference depending on task type.

BIG-Bench Hard (BBH), introduced by Suzgun and colleagues in late 2022, is a curated subset of 23 tasks on which prior models underperformed average human raters even with chain-of-thought prompting. Tasks include date understanding, logical deduction with three to seven objects, multistep arithmetic, geometric shapes, navigation, sports understanding, and tracking shuffled objects. Each task uses few-shot CoT prompting and reports macro-averaged accuracy across the 23 tasks.

Performance trajectory. PaLM 540B scored 65% on BBH with chain-of-thought in 2022 (vs 52% average human raters). GPT-4 reached 83% at release. Claude 3.5 Sonnet crossed 93%. By 2025, frontier models (o3, Claude 4 Opus, Gemini 2.5 Pro) all sit above 94%, and BBH is considered saturated. Several individual sub-tasks (notably "tracking shuffled objects" with seven objects, and "logical deduction" with seven objects) remain stubbornly below 95%, they probe long-context working memory more than reasoning.

Known issues. Like every public benchmark, contamination is endemic, BIG-Bench has been on GitHub since 2022 and appears in countless training corpora. The macro-average across heterogeneous tasks also disguises wide variation: models can score 100% on some tasks and 30% on others while landing on a respectable mean. Per-task analysis is almost always more informative than the headline number.

Modern relevance. BIG-Bench's primary legacy is conceptual: it normalised the idea that frontier models should be tested on a suite of capabilities rather than a single benchmark. Its descendants (HELM, LiveBench, MMLU-Pro) all owe their structure to BIG-Bench. BBH itself is still reported on most model cards but is no longer discriminative at the frontier.

Reference: Srivastava et al., "Beyond the Imitation Game", TMLR 2023; Suzgun et al., "Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them", ACL 2023.

Related terms: MMLU, HELM, LiveBench, Chain-of-Thought

Discussed in:

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).