HELM, Glossary, Textbook of AI

HELM (Holistic Evaluation of Language Models), introduced by Liang and colleagues at Stanford CRFM in 2022, is not a single benchmark but a systematic evaluation framework that runs language models across many scenarios while measuring multiple metrics simultaneously. The original release covered 42 scenarios (question answering, summarisation, sentiment, toxicity, information retrieval, language modelling, knowledge, reasoning, harms, and more) and 7 metric categories: accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency.

HELM's central methodological contribution is the scenario × metric matrix: every model is run on every scenario, and a single score is reported in every metric category. This exposes trade-offs that single-headline benchmarks hide, for example, that an instruction-tuned model may gain on accuracy but lose on calibration, or that scaling helps factual QA but barely moves bias or toxicity.

HELM is maintained by the Stanford Center for Research on Foundation Models as a continuously-updated leaderboard at crfm.stanford.edu/helm. By 2025 the framework has expanded into multiple specialised tracks: HELM Lite (a lighter scenarios suite for fast comparison), HELM Classic (the original 42 scenarios), HELM Instruct (instruction-following), HELM MMLU (re-grading of MMLU with consistent prompting), HELM Safety (safety evaluation), HELM-Code, and HELM-Medical.

Performance trajectory. Because HELM aggregates many scenarios, no single trajectory captures it. The Stanford leaderboard shows the typical pattern: GPT-3 era models clustered around 0.4 mean win rate, GPT-4 / Claude 2 era around 0.7, and 2024–2025 frontier models (Claude 3.5 Sonnet, GPT-4o, Gemini 1.5 Pro) approach 0.85+ across scenarios. Crucially, the rank order of models flips between metrics, a model that wins on accuracy may lose on calibration or fairness.

Known issues. The breadth that makes HELM valuable also makes it expensive to run at the leaderboard's update cadence; some scenarios have lagged frontier model releases by months. Aggregating across very different scenarios into a single "win rate" has been criticised as masking the trade-offs the framework was designed to expose.

Modern relevance. HELM is the most-cited holistic evaluation framework in academic AI policy and safety work, and is widely used by regulators (UK AI Safety Institute, NIST AISI, EU AI Act drafting committees) as a reference benchmark suite.

Reference: Liang et al., "Holistic Evaluation of Language Models", TMLR 2023.

Related terms: MMLU, BIG-Bench and BBH, Chatbot Arena, LiveBench

Discussed in:

Chapter 7: Supervised Learning, Evaluation Metrics

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).