Glossary

Adversarial Training (LLMs)

Adversarial training in the context of LLMs is the practice of explicitly including adversarial examples, known jailbreaks, prompt injections, and harmful prompts, in the training mixture, with correct (refusing or safe) responses as targets, in order to harden the model against future attacks.

The technique has a long pedigree in computer vision (Madry et al. 2018, PGD: train against worst-case perturbations within an epsilon-ball) and was adapted to language models as the jailbreak literature matured in 2023–2024.

Pipeline

A typical industrial pipeline:

  1. Red team generation, internal red teams plus automated tools (GCG, AutoDAN, PAIR, MSJ) produce adversarial prompts.

  2. Curation, prompts that successfully jailbreak the current model are kept; safe responses are written by humans or generated and screened.

  3. Training, the adversarial set is mixed into the supervised fine-tuning and RLHF data.

  4. Evaluation, the new model is re-tested against the same and novel adversarial sets.

  5. Iteration, the loop repeats; new attacks are added as they emerge.

Trade-offs

Adversarial training is one of the few techniques with a measurable safety effect, but it has well-known limitations:

  • Overrefusal, heavily adversarially-trained models refuse benign requests too readily ("How do I kill a process in Linux?" misclassified as harmful).

  • Distribution shift, the model is robust to attacks that look like training adversarials but may fail on genuinely novel attacks.

  • Capability cost, there is some evidence that aggressive safety training degrades general performance ("alignment tax").

Anthropic's constitutional AI and RL from AI feedback (RLAIF) can be seen as forms of large-scale adversarial training in which a critic model generates the adversarial pressure.

Status

As of 2026, all frontier labs use adversarial training as a core component of their safety stack. Public benchmarks (HarmBench, AdvBench, JailbreakBench) track progress; aggregate jailbreak success rates against frontier models have fallen from >80% in 2023 to <10% on common attacks in 2026, though adaptive attackers continue to find new vectors.

References

  • Madry et al. (2018). Towards Deep Learning Models Resistant to Adversarial Attacks.

  • Bai et al. (Anthropic, 2022). Constitutional AI.

  • Mazeika et al. (2024). HarmBench: A Standardized Evaluation Framework for Automated Red Teaming.

Related terms: Jailbreak, GCG Attack, Constitutional AI, RLHF, Red-Teaming (LLMs), PGD Attack

Discussed in:

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).