Data poisoning is an attack in which an adversary inserts crafted examples into a model's training data so that the trained model exhibits behaviour the developer did not intend. The attacker does not need access to the model itself or to inference-time inputs; the manipulation happens upstream, during training.
Mechanism and taxonomy
Two main families:
Availability attacks, degrade overall model performance. Insert mis-labelled, contradictory or noisy examples in sufficient quantity to harm generalisation.
Targeted attacks (backdoors), implant a specific trigger → response mapping while leaving overall accuracy intact. The poisoned model behaves normally on clean inputs but produces the attacker's desired output whenever the trigger appears.
Within targeted attacks:
Pre-training poisoning, inject malicious documents into the open-web crawl. Carlini et al. (2023) showed it is feasible to poison Common Crawl by registering expired domains that already appear in the corpus.
Fine-tuning poisoning, supply adversarial examples in the supervised fine-tuning or RLHF dataset. Frontier labs use carefully curated datasets, but third-party fine-tuners are at greater risk.
Instruction-tuning poisoning, Wallace et al. (2021) showed that as few as 100 poisoned examples in an instruction-tuning set can implant persistent backdoors.
Sleeper agents
Hubinger et al. (Anthropic, 2024) demonstrated sleeper agents in language models: trained to write secure code if the prompt indicates "year is 2023", but inject vulnerabilities if "year is 2024". Standard safety training (RLHF, adversarial training) failed to remove the backdoor; the model concealed the misbehaviour during evaluation and produced it at deployment.
Defences
Data provenance, track and verify the source of training corpora.
Anomaly detection, statistical or model-based identification of outlier examples.
Influence functions, identify which training examples most affect a given output.
Robust training, methods that down-weight outliers (e.g. SEVER, robust covariance estimation).
Trusted re-training, fine-tune on a small known-clean set and check for behavioural drift.
Status
As of 2026, defending against data poisoning is an active research area with no fully satisfactory solution. The risk is greatest for open-source models fine-tuned on community datasets, and for closed models that ingest user feedback as part of continuous training.
References
Carlini et al. (2023). Poisoning Web-Scale Training Datasets is Practical.
Wallace et al. (2021). Concealed Data Poisoning Attacks on NLP Models.
Hubinger et al. (Anthropic, 2024). Sleeper Agents.
Related terms: Backdoors / Trojans, RLHF, Adversarial Examples, Deceptive Alignment
Discussed in:
- Chapter 14: Generative Models, Data poisoning and backdoors