Glossary

Red-Teaming (LLMs)

Red-teaming in AI is the systematic, adversarial probing of a model, by humans, by automated tools, or by other models, to discover failure modes that ordinary evaluation would miss. The term is borrowed from cybersecurity, where a "red team" simulates an attacker against a defender ("blue team"). For frontier AI, red-teaming is now a near-universal component of pre-deployment safety practice and is required by the Bletchley and Seoul voluntary commitments.

Methods

Modern red-teaming combines:

  • Manual probing, domain experts (chemists, biologists, cyber-security researchers, ethicists) attempt to elicit harmful outputs from the model.

  • Crowdsourced red-teaming, external participants given bounty incentives.

  • Automated red-teaming, tools like GCG, AutoDAN, PAIR, and TAP generate adversarial prompts at scale.

  • Tournament play, public competitions (HackAPrompt, Anthropic's bug bounty, Apart Research) crowdsource jailbreaks.

  • Capability evaluations, structured tests of dangerous-capability uplift in CBRN, cyber, autonomy, persuasion.

Major teams

  • Anthropic Frontier Red Team, internal team focused on chemical, biological, radiological, nuclear (CBRN) and autonomous-replication threats.

  • OpenAI Red Team Network, paid external experts contracted before launch.

  • Google DeepMind Frontier Safety Team, runs the Frontier Safety Framework evaluations.

  • METR (Model Evaluation and Threat Research), independent organisation evaluating autonomy-related capabilities.

  • Apollo Research, focuses on deception and scheming evaluations.

  • UK AI Safety Institute (AISI) and US AISI, government bodies with pre-deployment access to frontier models since 2024.

Limitations

Red-teaming is necessary but not sufficient:

  • Coverage gap, what is not tested is not known.

  • Evaluation gaming, models trained against red-team examples may learn to recognise the evaluation context and behave differently in deployment (cf. sleeper agents).

  • Capability ceiling, red-teamers cannot test capabilities they themselves do not understand; testing for super-human persuasion or super-human bioweapon design is fundamentally hard.

  • Reproducibility, manual red-team results depend on the testers and are hard to compare across labs.

Status

All frontier labs publish (sometimes redacted) red-team summaries with model releases. Anthropic's Responsible Scaling Policy and OpenAI's Preparedness Framework explicitly tie deployment decisions to red-team outcomes. Independent external red-teaming, especially via national AISIs, is increasingly the norm.

References

  • Perez et al. (2022). Red Teaming Language Models with Language Models.

  • Anthropic (2024). Frontier Red Team description.

  • UK AISI (2024). Pre-deployment evaluation reports.

Related terms: Evaluations / Capability Evaluations, Jailbreak, Responsible Scaling Policy (RSP), Frontier AI Safety Commitments, Bletchley AI Safety Summit

Discussed in:

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).