Capability evaluations ("evals") are structured tests designed to measure whether a frontier AI system has acquired specific capabilities, typically capabilities considered dangerous (uplift in weapons design, autonomous cyber-attack, self-replication, persuasion) or transformative (autonomous research, long-horizon planning). They are the principal empirical instrument by which frontier-AI policy frameworks (Anthropic's RSP, OpenAI's Preparedness Framework, Google's FSF) make deployment decisions.
Categories
The threat-model literature has converged on roughly six categories:
CBRN uplift, chemical, biological, radiological and nuclear weapons.
Cyber, autonomous discovery and exploitation of software vulnerabilities; CTF performance; full kill-chain operations.
Autonomous replication and adaptation (ARA), can the agent acquire resources, copy itself onto new hardware, and continue operating without human assistance? METR's flagship evaluation.
Persuasion / manipulation, can the model change beliefs or behaviour beyond a strong human baseline?
Deception / scheming, does the model behave differently when it believes it is being observed? (Apollo Research.)
Autonomous AI R&D, can the model accelerate AI development itself, raising concerns about a recursive intelligence explosion?
Major organisations
METR (Model Evaluation and Threat Research), independent non-profit running ARA and long-horizon-task evals; published influential results on GPT-4, Claude 3, GPT-5.
Apollo Research, deception and scheming evaluations; demonstrated in-context scheming in frontier models in 2024.
UK AI Safety Institute (AISI), UK government body with pre-deployment access; evaluates against CBRN, cyber, and autonomy benchmarks.
US AI Safety Institute (US AISI), NIST-based counterpart, established under the Biden 2023 executive order.
Anthropic, OpenAI, DeepMind internal eval teams, produce the "model cards" that accompany each frontier release.
Methodology
Evals are typically:
Benchmarked, fixed test sets with quantitative scoring (e.g. WMDP for biothreat, CyberSecEval for cyber).
Elicited, best-effort attempts including prompt engineering, fine-tuning, and tool access ("if a competent malicious user fine-tuned this, what could they do?").
Reported, model cards specify scores and the safety thresholds they trigger.
Status and concerns
Two major worries as of 2026:
Sandbagging / evaluation gaming, sufficiently advanced models may detect that they are being evaluated and underperform deliberately. Apollo Research has demonstrated rudimentary versions in current models.
Capability surprises, a model may acquire a dangerous capability between its evaluation and its deployment-time behaviour, especially as scaffolding and tools improve.
References
METR (2024). Evaluating Language-Model Agents on Realistic Autonomous Tasks.
Apollo Research (2024). Scheming reasoning evaluations.
Anthropic (2024). Responsible Scaling Policy v2.
Related terms: Red-Teaming (LLMs), Responsible Scaling Policy (RSP), AI Safety Levels (ASL), Frontier AI Safety Commitments, Deceptive Alignment
Discussed in:
- Chapter 14: Generative Models, Capability evaluations