Ethics & Safety: 16.16 Deepfakes, watermarking, content provenance

Dr Chris Paton

16.16 Deepfakes, watermarking, content provenance

Generative models make it cheap to fabricate realistic images, audio and video. The technical countermeasures fall into three groups.

Watermarking

Embed a signal in generated content that identifies it as AI-generated and survives common transformations. For images, Google's SynthID (DeepMind 2023) modifies pixel values in a way that is imperceptible to humans but detectable by a paired classifier; the modification persists through cropping, JPEG re-encoding, brightness/contrast changes and small rotations. The mathematical structure is a learned spread-spectrum signal: the watermark is a high-dimensional vector added to image features, and the detector is a classifier that scores against the known watermark direction.

For text, the Kirchenbauer et al. 2023 Watermark for Large Language Models Kirchenbauer, 2023 is the canonical scheme. At each generation step, hash the previous token to seed a pseudo-random partition of the vocabulary into "green" and "red" lists; bias the logits to prefer green. A statistical test on a candidate text counts the green-token fraction and computes a $z$-score against the null of unbiased generation. The scheme is robust to small edits and detectable with a few hundred tokens, but defeated by paraphrasing.

Watermarking has two open problems: the signal can be removed by an adversary willing to expend compute (Zhang, Edelman et al. 2023 2023 showed adversarial paraphrasing breaks every published watermark), and watermarks from competing vendors are incompatible.

C2PA: content credentials

The Coalition for Content Provenance and Authenticity (Adobe, Microsoft, the BBC, Truepic, Sony, Nikon, Canon, and others) defined an open standard for cryptographically signed provenance metadata attached to media files. A camera signs the photo at capture; an editor signs each transformation; the chain of signatures travels with the file. C2PA addresses the authenticity question (who made this and how) rather than the generation question (was this AI-made), but the two compose: a C2PA signature can include "generated by model X".

Adoption as of April 2026: most professional cameras (Sony α7-IV, Nikon Z9, Leica M11), Adobe Photoshop and Lightroom, OpenAI's DALL-E and ChatGPT image outputs, Microsoft Designer, the BBC's news pipeline. Browser support is partial; the Chromium "Content Credentials" badge is in beta.

Detection classifiers

A failed approach worth mentioning. From 2018-2023 there was significant work on classifiers that detect AI-generated content from the artefacts of the generation process. The field is now widely seen as a losing race: each new generator removes the artefacts the previous detector relied on, and cross-generator generalisation is poor. The 2023 OpenAI text-detection classifier was withdrawn after six months because of low accuracy and high false-positive rates against non-native English writing.

The settled position in the C2PA community is that provenance (signed metadata) is more tractable than detection (forensic classification), and that the long-run answer is a signed-by-default capture pipeline, not a forensic classifier.