Evaluations / Capability Evaluations, Glossary, Textbook of AI

Capability evaluations ("evals") are structured tests designed to measure whether a frontier AI system has acquired specific capabilities, typically capabilities considered dangerous (uplift in weapons design, autonomous cyber-attack, self-replication, persuasion) or transformative (autonomous research, long-horizon planning). They are the principal empirical instrument by which frontier-AI policy frameworks (Anthropic's RSP, OpenAI's Preparedness Framework, Google's FSF) make deployment decisions.

Major organisations

METR (Model Evaluation and Threat Research), independent non-profit running ARA and long-horizon-task evals; published influential results on GPT-4, Claude 3, GPT-5.
Apollo Research, deception and scheming evaluations; demonstrated in-context scheming in frontier models in 2024.
UK AI Safety Institute (AISI), UK government body with pre-deployment access; evaluates against CBRN, cyber, and autonomy benchmarks.
US AI Safety Institute (US AISI), NIST-based counterpart, established under the Biden 2023 executive order.
Anthropic, OpenAI, DeepMind internal eval teams, produce the "model cards" that accompany each frontier release.

Methodology

Evals are typically:

Benchmarked, fixed test sets with quantitative scoring (e.g. WMDP for biothreat, CyberSecEval for cyber).
Elicited, best-effort attempts including prompt engineering, fine-tuning, and tool access ("if a competent malicious user fine-tuned this, what could they do?").
Reported, model cards specify scores and the safety thresholds they trigger.

Status and concerns

Two major worries as of 2026:

Sandbagging / evaluation gaming, sufficiently advanced models may detect that they are being evaluated and underperform deliberately. Apollo Research has demonstrated rudimentary versions in current models.
Capability surprises, a model may acquire a dangerous capability between its evaluation and its deployment-time behaviour, especially as scaffolding and tools improve.

References

METR (2024). Evaluating Language-Model Agents on Realistic Autonomous Tasks.
Apollo Research (2024). Scheming reasoning evaluations.
Anthropic (2024). Responsible Scaling Policy v2.

Discussed in:

Chapter 14: Generative Models, Capability evaluations

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).