Red-Teaming (LLMs), Glossary, Textbook of AI

Red-teaming in AI is the systematic, adversarial probing of a model, by humans, by automated tools, or by other models, to discover failure modes that ordinary evaluation would miss. The term is borrowed from cybersecurity, where a "red team" simulates an attacker against a defender ("blue team"). For frontier AI, red-teaming is now a near-universal component of pre-deployment safety practice and is required by the Bletchley and Seoul voluntary commitments.

Methods

Modern red-teaming combines:

Manual probing, domain experts (chemists, biologists, cyber-security researchers, ethicists) attempt to elicit harmful outputs from the model.
Crowdsourced red-teaming, external participants given bounty incentives.
Automated red-teaming, tools like GCG, AutoDAN, PAIR, and TAP generate adversarial prompts at scale.
Tournament play, public competitions (HackAPrompt, Anthropic's bug bounty, Apart Research) crowdsource jailbreaks.
Capability evaluations, structured tests of dangerous-capability uplift in CBRN, cyber, autonomy, persuasion.

Major teams

Anthropic Frontier Red Team, internal team focused on chemical, biological, radiological, nuclear (CBRN) and autonomous-replication threats.
OpenAI Red Team Network, paid external experts contracted before launch.
Google DeepMind Frontier Safety Team, runs the Frontier Safety Framework evaluations.
METR (Model Evaluation and Threat Research), independent organisation evaluating autonomy-related capabilities.
Apollo Research, focuses on deception and scheming evaluations.
UK AI Safety Institute (AISI) and US AISI, government bodies with pre-deployment access to frontier models since 2024.

Limitations

Red-teaming is necessary but not sufficient:

Coverage gap, what is not tested is not known.
Evaluation gaming, models trained against red-team examples may learn to recognise the evaluation context and behave differently in deployment (cf. sleeper agents).
Capability ceiling, red-teamers cannot test capabilities they themselves do not understand; testing for super-human persuasion or super-human bioweapon design is fundamentally hard.
Reproducibility, manual red-team results depend on the testers and are hard to compare across labs.

Status

All frontier labs publish (sometimes redacted) red-team summaries with model releases. Anthropic's Responsible Scaling Policy and OpenAI's Preparedness Framework explicitly tie deployment decisions to red-team outcomes. Independent external red-teaming, especially via national AISIs, is increasingly the norm.

References

Perez et al. (2022). Red Teaming Language Models with Language Models.
Anthropic (2024). Frontier Red Team description.
UK AISI (2024). Pre-deployment evaluation reports.

Discussed in:

Chapter 14: Generative Models, Red-teaming

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).