Hanlin Zhang, Benjamin L. Edelman, Danilo Francati, Daniele Venturi, Giuseppe Ateniese, & Boaz Barak (2023)
arXiv:2311.04378.
URL: https://arxiv.org/abs/2311.04378
Abstract. A theoretical and empirical analysis of the limits of LLM watermarking. The authors prove that any "strong" watermark, one that survives a polynomial-time adversary with paraphrasing access, is impossible for generative models that approximate the natural-language distribution closely enough. They demonstrate the result empirically by breaking the published Kirchenbauer-style watermarks via adversarial paraphrasing at modest compute cost. The paper sharpened the policy debate around mandatory AI-content watermarking and the limits of detection-based governance approaches.
Tags: safety watermarking language-models