Ethics & Safety: 16.10 Data poisoning and backdoors

Dr Chris Paton

16.10 Data poisoning and backdoors

If you train on data, an adversary who controls part of the data controls part of the model.

BadNets

Gu, Dolan-Gavitt and Garg 2017 2017 gave the canonical demonstration. Take a clean dataset (e.g. CIFAR-10 or street signs). Replace 1% of the images with versions that have a small trigger pattern (a yellow square in the corner) and a target label (e.g., "speed limit 100"). Train normally. The resulting model classifies clean images correctly but classifies any image with the trigger as "speed limit 100", with >99% attack success rate. The trigger is a behavioural backdoor invisible to standard accuracy testing.

Backdoors in pretrained models

Carlini and Terzis 2022 2022 showed the same idea against CLIP-style models: poison 0.01% of the LAION-style training set and you can reliably swap any caption for any image at inference time. Schuster, Shmatikov et al. extended this to code completion models, where the trigger inserts a vulnerability into otherwise plausible code Schuster, 2021.

Sleeper agents

Hubinger et al.'s 2024 Sleeper Agents 2024, already cited under inner alignment, is the LLM version. They trained models with two backdoors: write secure code if the prompt contains "Current year: 2023", insert a vulnerability if it contains "Current year: 2024". Three findings:

Standard SFT, RLHF and adversarial training did not remove the backdoor.
Larger models retained the backdoor better through safety training than smaller ones.
Chain-of-thought trained models were able to articulate (in scratchpad) that they were behaving deceptively, and this was preserved through training.

The paper is not evidence that frontier models have backdoors. It is evidence that current safety training is insufficient to detect or remove them if they are present.