Communications in Cryptology IACR CiC

Publicly-Detectable Watermarking for Language Models

Authors

Jaiden Fairoze, Sanjam Garg, Somesh Jha, Saeed Mahloujifar, Mohammad Mahmoody, Mingyuan Wang
Jaiden Fairoze
University of California, Berkeley, USA
fairoze at berkeley dot edu
Sanjam Garg ORCID
University of California, Berkeley, USA
sanjamg at berkeley dot edu
Somesh Jha
University of Wisconsin–Madison, USA
jha at cs dot wisc dot edu
Saeed Mahloujifar ORCID
Fundamental Artificial Intelligence Research at Meta, USA
saeedm at meta dot com
Mohammad Mahmoody ORCID
University of Virginia, USA
mohammad at virginia dot edu
Mingyuan Wang ORCID
New York University Shanghai, China
mingyuan dot wang at nyu dot edu

Abstract

We present a publicly-detectable watermarking scheme for LMs: the detection algorithm contains no secret information, and it is executable by anyone. We embed a publicly-verifiable cryptographic signature into LM output using rejection sampling and prove that this produces unforgeable and distortion-free (i.e., undetectable without access to the public key) text output. We make use of error-correction to overcome periods of low entropy, a barrier for all prior watermarking schemes. We implement our scheme and find that our formal claims are met in practice.

References

[Aar23]
Scott Aaronson. Neurocryptography. Invited Plenary Talk at Crypto'2023. 2023.
[AF21]
Sahar Abdelnabi and Mario Fritz. Adversarial watermarking transformer: Towards tracing text provenance with data hiding. In 2021 IEEE Symposium on Security and Privacy (SP). 2021. DOI: 10.1109/SP40001.2021.00083
[Ber16]
Daria Beresneva. Computer-generated text detection using machine learning: A systematic review. In Natural Language Processing and Information Systems: 21st International Conference on Applications of Natural Language to Information Systems. 2016. DOI: 10.1007/978-3-319-41754-7_43
[BLS01]
Dan Boneh, Ben Lynn, and Hovav Shacham. Short signatures from the Weil pairing. In International Conference on the Theory and Application of Cryptology and Information Security. 2001. DOI: 10.1007/3-540-45682-1_30
[BR93]
Mihir Bellare and Phillip Rogaway. Random oracles are practical: A paradigm for designing efficient protocols. In Proceedings of the 1st ACM Conference on Computer and Communications Security. 1993. DOI: 10.1145/168588.168596
[Cac98]
Christian Cachin. An information-theoretic model for steganography. In International Workshop on Information Hiding. 1998. DOI: 10.1007/3-540-49380-8_21
[CBZ+23]
Souradip Chakraborty, Amrit Singh Bedi, Sicheng Zhu, Bang An, Dinesh Manocha, and Furong Huang. On the possibilities of AI-generated text detection. arXiv preprint arXiv:2304.04736, 2023. DOI: 10.48550/arXiv.2304.04736
[CGZ24]
Miranda Christ, Sam Gunn, and Or Zamir. Undetectable Watermarks for Language Models. In Proceedings of Thirty Seventh Conference on Learning Theory. 2024.
[CWJ+23]
Yi Chen, Rui Wang, Haiyun Jiang, Shuming Shi, and Ruifeng Xu. Exploring the use of large language models for reference-free text quality evaluation: A preliminary empirical study. arXiv preprint arXiv:2304.00723, 2023. DOI: 10.48550/arXiv.2304.00723
[Fac23]
Hugging Face. Hugging Face Transformers. Accessed: 2023-10-08. https://huggingface.co/docs/transformers/index. 2023.
[GHJ+22]
Qian Guo, Clemens Hlauschek, Thomas Johansson, Norman Lahr, Alexander Nilsson, and Robin Leander Schröder. Don’t reject this: Key-recovery timing attacks due to rejection-sampling in HQC and BIKE. IACR Transactions on Cryptographic Hardware and Embedded Systems, 2022. DOI: 10.46586/tches.v2022.i3.223-263
[GM24]
Noah Golowich and Ankur Moitra. Edit distance robust watermarks for language models. arXiv preprint arXiv:2406.02633, 2024. DOI: 10.48550/arXiv.2406.02633
[Goo24]
Google DeepMind. Our next-generation model: Gemini 1.5. Accessed: 2024-02-20. https://blog.google/technology/ai/google-gemini-next-generation-model-february-2024/. 2024.
[GPT23]
GPTZero. GPTZero | The Trusted AI Detector for ChatGPT, GPT-4, & More. Accessed: 2023-10-05. https://gptzero.me/. 2023.
[GSR19]
Sebastian Gehrmann, Hendrik Strobelt, and Alexander M Rush. GLTR: Statistical detection and visualization of generated text. arXiv preprint arXiv:1906.04043, 2019. DOI: 10.48550/arXiv.1906.04043
[GW17]
Venkatesan Guruswami and Carol Wang. Deletion codes in the high-noise and high-rate regimes. IEEE Transactions on Information Theory, 2017. DOI: 10.1109/TIT.2017.2659765
[HAAL23]
Jan Hendrik Kirchner, Lama Ahmad, Scott Aaronson, and Jan Leike. New AI classifier for indicating AI-written text. Accessed: 2023-10-05. https://openai.com/blog/new-ai-classifier-for-indicating-ai- written-text. 2023.
[HBD+19]
Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751, 2019. DOI: 10.48550/arXiv.1904.09751
[HLV02]
Nicholas J Hopper, John Langford, and Luis Von Ahn. Provably secure steganography. In Advances in Cryptology—CRYPTO 2002: 22nd Annual International Cryptology Conference. 2002. DOI: 10.1007/3-540-45708-9_6
[Hoe94]
Wassily Hoeffding. Probability inequalities for sums of bounded random variables. The Collected Works of Wassily Hoeffding, 1994. DOI: 10.1007/978-1-4612-0865-5_26
[JAL20]
Ganesh Jawahar, Muhammad Abdul-Mageed, and Laks VS Lakshmanan. Automatic detection of machine generated text: A critical survey. arXiv preprint arXiv:2011.01314, 2020. DOI: 10.48550/arXiv.2011.01314
[JSM+23]
Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, and others. Mistral 7B. arXiv preprint arXiv:2310.06825, 2023. DOI: 10.48550/arXiv.2310.06825
[KGW+23]
John Kirchenbauer, Jonas Geiping, Yuxin Wen, Jonathan Katz, Ian Miers, and Tom Goldstein. A watermark for large language models. In Proceedings of the 40th International Conference on Machine Learning. 2023.
[KJGR21]
Gabriel Kaptchuk, Tushar M Jois, Matthew Green, and Aviel D Rubin. Meteor: Cryptographically secure steganography for realistic distributions. In Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security. 2021. DOI: 10.1145/3460120.3484550
[KSK+23]
Kalpesh Krishna, Yixiao Song, Marzena Karpinska, John Wieting, and Mohit Iyyer. Paraphrasing evades detectors of AI-generated text, but retrieval is an effective defense. Advances in Neural Information Processing Systems, 36, 2023.
[KTHL24]
Rohith Kuditipudi, John Thickstun, Tatsunori Hashimoto, and Percy Liang. Robust distortion-free watermarks for language models. Transactions on Machine Learning Research, 2024.
[LPH+23]
Aiwei Liu, Leyi Pan, Xuming Hu, Shu'ang Li, Lijie Wen, Irwin King, and Philip S Yu. A private watermark for large language models. arXiv preprint arXiv:2307.16230, 2023. DOI: 10.48550/arXiv.2307.16230
[LUY08]
Thomas Lavergne, Tanguy Urvoy, and François Yvon. Detecting fake content with relative entropy scoring. Proceedings of the 2008 International Conference on Uncovering Plagiarism, Authorship and Social Software Misuse, 2008.
[MLK+23]
Eric Mitchell, Yoonho Lee, Alexander Khazatsky, Christopher D. Manning, and Chelsea Finn. DetectGPT: zero-shot machine-generated text detection using probability curvature. Proceedings of the 40th International Conference on Machine Learning, 2023.
[MTDZ24]
Travis Munyer, Abdullah All Tanvir, Arjon Das, and Xin Zhong. DeepTextMark: A Deep learning-driven text watermarking approach for identifying large language model generated text. IEEE Access, 2024. DOI: 10.1109/ACCESS.2024.3376693
[Ope23]
OpenAI. GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023. DOI: 10.48550/arXiv.2303.08774
[PGM+19]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, and others. PyTorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems, 2019.
[PSF+23]
Julien Piet, Chawin Sitawarin, Vivian Fang, Norman Mu, and David Wagner. Mark my words: Analyzing and evaluating language model watermarks. arXiv preprint arXiv:2312.00273, 2023. DOI: 10.48550/arXiv.2312.00273
[QYH+24]
Wenjie Qu, Dong Yin, Zixin He, Wei Zou, Tianyang Tao, Jinyuan Jia, and Jiaheng Zhang. Provably robust multi-bit watermarking for AI-generated text via error correction code. arXiv preprint arXiv:2401.16820, 2024. DOI: 10.48550/arXiv.2401.16820
[QZL+23]
Jipeng Qiang, Shiyu Zhu, Yun Li, Yi Zhu, Yunhao Yuan, and Xindong Wu. Natural language watermarking via paraphraser-based lexical substitution. Artificial Intelligence, 2023. DOI: 10.1016/j.artint.2023.103859
[RBL+22]
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2022.
[RSR+20]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 2020.
[SC05]
Qibin Sun and Shih-Fu Chang. A secure and robust digital signature scheme for JPEG2000 image authentication. IEEE Transactions on Multimedia, 2005. DOI: 10.1109/TMM.2005.846776
[SKB+23]
Vinu Sankar Sadasivan, Aounon Kumar, Sriram Balasubramanian, Wenxiao Wang, and Soheil Feizi. Can AI-generated text be reliably detected?. arXiv preprint arXiv:2303.11156, 2023. DOI: 10.48550/arXiv.2303.11156
[TMS+23]
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, and others. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023. DOI: 10.48550/arXiv.2307.09288
[WKR+19]
Sean Welleck, Ilia Kulikov, Stephen Roller, Emily Dinan, Kyunghyun Cho, and Jason Weston. Neural text generation with unlikelihood training. In International Conference on Learning Representations. 2019.
[WM01]
Ping Wah Wong and Nasir Memon. Secret and public key image watermarking schemes for image authentication and ownership verification. IEEE transactions on image processing, 2001. DOI: 10.1109/83.951543
[YAJK23]
KiYoon Yoo, Wonhyuk Ahn, Jiho Jang, and Nojun Kwak. Robust natural language watermarking through invariant features. arXiv preprint arXiv:2305.01904, 2023. DOI: 10.48550/arXiv.2305.01904
[ZEF+24]
Hanlin Zhang, Benjamin L. Edelman, Danilo Francati, Daniele Venturi, Giuseppe Ateniese, and Boaz Barak. Watermarks in the sand: Impossibility of strong watermarking for language models. In Proceedings of the 41st International Conference on Machine Learning. 2024.
[ZHR+19]
Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, Franziska Roesner, and Yejin Choi. Defending against neural fake news. Advances in Neural Information Processing Systems, 2019.
[ZRG+22]
Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, and others. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022. DOI: 10.48550/arXiv.2205.01068
[ZWQ21]
Zhongxiang Zheng, Anyu Wang, and Lingyue Qin. Rejection sampling revisit: how to choose parameters in lattice-based signature. Mathematical Problems in Engineering, 2021. DOI: 10.1155/2021/9948618

PDFPDF Open access

History
Submitted: 2024-10-09
Accepted: 2024-12-03
Published: 2025-01-13
How to cite

Jaiden Fairoze, Sanjam Garg, Somesh Jha, Saeed Mahloujifar, Mohammad Mahmoody, and Mingyuan Wang, Publicly-Detectable Watermarking for Language Models. IACR Communications in Cryptology, vol. 1, no. 4, Jan 13, 2025, doi: 10.62056/ahmpdkp10.

License

Copyright is held by the author(s)

This work is licensed under a Creative Commons Attribution (CC BY) license.