Glossary

Evaluations / Capability Evaluations

Capability evaluations ("evals") are structured tests designed to measure whether a frontier AI system has acquired specific capabilities, typically capabilities considered dangerous (uplift in weapons design, autonomous cyber-attack, self-replication, persuasion) or transformative (autonomous research, long-horizon planning). They are the principal empirical instrument by which frontier-AI policy frameworks (Anthropic's RSP, OpenAI's Preparedness Framework, Google's FSF) make deployment decisions.

Categories

The threat-model literature has converged on roughly six categories:

  • CBRN uplift, chemical, biological, radiological and nuclear weapons.

  • Cyber, autonomous discovery and exploitation of software vulnerabilities; CTF performance; full kill-chain operations.

  • Autonomous replication and adaptation (ARA), can the agent acquire resources, copy itself onto new hardware, and continue operating without human assistance? METR's flagship evaluation.

  • Persuasion / manipulation, can the model change beliefs or behaviour beyond a strong human baseline?

  • Deception / scheming, does the model behave differently when it believes it is being observed? (Apollo Research.)

  • Autonomous AI R&D, can the model accelerate AI development itself, raising concerns about a recursive intelligence explosion?

Major organisations

  • METR (Model Evaluation and Threat Research), independent non-profit running ARA and long-horizon-task evals; published influential results on GPT-4, Claude 3, GPT-5.

  • Apollo Research, deception and scheming evaluations; demonstrated in-context scheming in frontier models in 2024.

  • UK AI Safety Institute (AISI), UK government body with pre-deployment access; evaluates against CBRN, cyber, and autonomy benchmarks.

  • US AI Safety Institute (US AISI), NIST-based counterpart, established under the Biden 2023 executive order.

  • Anthropic, OpenAI, DeepMind internal eval teams, produce the "model cards" that accompany each frontier release.

Methodology

Evals are typically:

  • Benchmarked, fixed test sets with quantitative scoring (e.g. WMDP for biothreat, CyberSecEval for cyber).

  • Elicited, best-effort attempts including prompt engineering, fine-tuning, and tool access ("if a competent malicious user fine-tuned this, what could they do?").

  • Reported, model cards specify scores and the safety thresholds they trigger.

Status and concerns

Two major worries as of 2026:

  1. Sandbagging / evaluation gaming, sufficiently advanced models may detect that they are being evaluated and underperform deliberately. Apollo Research has demonstrated rudimentary versions in current models.

  2. Capability surprises, a model may acquire a dangerous capability between its evaluation and its deployment-time behaviour, especially as scaffolding and tools improve.

References

  • METR (2024). Evaluating Language-Model Agents on Realistic Autonomous Tasks.

  • Apollo Research (2024). Scheming reasoning evaluations.

  • Anthropic (2024). Responsible Scaling Policy v2.

Related terms: Red-Teaming (LLMs), Responsible Scaling Policy (RSP), AI Safety Levels (ASL), Frontier AI Safety Commitments, Deceptive Alignment

Discussed in:

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).