Publicly-Detectable Watermarking for Language Models

Jaiden Fairoze; Sanjam Garg; Somesh Jha; Saeed Mahloujifar; Mohammad Mahmoody; Mingyuan Wang

doi:10.62056/ahmpdkp10

Publicly-Detectable Watermarking for Language Models

Authors

Jaiden Fairoze, Sanjam Garg, Somesh Jha, Saeed Mahloujifar, Mohammad Mahmoody, Mingyuan Wang

Jaiden Fairoze

University of California, Berkeley, USA
fairoze at berkeley dot edu

Sanjam Garg

University of California, Berkeley, USA
sanjamg at berkeley dot edu

Somesh Jha

University of Wisconsin–Madison, USA
jha at cs dot wisc dot edu

Saeed Mahloujifar

Fundamental Artificial Intelligence Research at Meta, USA
saeedm at meta dot com

Mohammad Mahmoody

University of Virginia, USA
mohammad at virginia dot edu

Mingyuan Wang

New York University Shanghai, China
mingyuan dot wang at nyu dot edu

Keywords: LM watermarking language models unforgeable watermark distortion-free sampling

Abstract

We present a publicly-detectable watermarking scheme for LMs: the detection algorithm contains no secret information, and it is executable by anyone. We embed a publicly-verifiable cryptographic signature into LM output using rejection sampling and prove that this produces unforgeable and distortion-free (i.e., undetectable without access to the public key) text output. We make use of error-correction to overcome periods of low entropy, a barrier for all prior watermarking schemes. We implement our scheme and find that our formal claims are met in practice.

References

[Aar23]

Scott Aaronson. Neurocryptography. Invited Plenary Talk at Crypto'2023. 2023.

Google Scholar ePrint

[AF21]

Sahar Abdelnabi and Mario Fritz. Adversarial watermarking transformer: Towards tracing text provenance with data hiding. In 2021 IEEE Symposium on Security and Privacy (SP). 2021. DOI: 10.1109/SP40001.2021.00083

Google Scholar ePrint

[Ber16]

Daria Beresneva. Computer-generated text detection using machine learning: A systematic review. In Natural Language Processing and Information Systems: 21st International Conference on Applications of Natural Language to Information Systems. 2016. DOI: 10.1007/978-3-319-41754-7_43

Google Scholar ePrint

[BLS01]

Dan Boneh, Ben Lynn, and Hovav Shacham. Short signatures from the Weil pairing. In International Conference on the Theory and Application of Cryptology and Information Security. 2001. DOI: 10.1007/3-540-45682-1_30

Google Scholar ePrint

[BR93]

Mihir Bellare and Phillip Rogaway. Random oracles are practical: A paradigm for designing efficient protocols. In Proceedings of the 1st ACM Conference on Computer and Communications Security. 1993. DOI: 10.1145/168588.168596

Google Scholar ePrint

[Cac98]

Christian Cachin. An information-theoretic model for steganography. In International Workshop on Information Hiding. 1998. DOI: 10.1007/3-540-49380-8_21

Google Scholar ePrint

[CBZ⁺23]

Souradip Chakraborty, Amrit Singh Bedi, Sicheng Zhu, Bang An, Dinesh Manocha, and Furong Huang. On the possibilities of AI-generated text detection. arXiv preprint arXiv:2304.04736, 2023. DOI: 10.48550/arXiv.2304.04736

Google Scholar ePrint

[CGZ24]

Miranda Christ, Sam Gunn, and Or Zamir. Undetectable Watermarks for Language Models. In Proceedings of Thirty Seventh Conference on Learning Theory. 2024.

Google Scholar ePrint

[CWJ⁺23]

Yi Chen, Rui Wang, Haiyun Jiang, Shuming Shi, and Ruifeng Xu. Exploring the use of large language models for reference-free text quality evaluation: A preliminary empirical study. arXiv preprint arXiv:2304.00723, 2023. DOI: 10.48550/arXiv.2304.00723

Google Scholar ePrint

[Fac23]

Hugging Face. Hugging Face Transformers. Accessed: 2023-10-08. https://huggingface.co/docs/transformers/index. 2023.

Google Scholar ePrint

[GHJ⁺22]

Qian Guo, Clemens Hlauschek, Thomas Johansson, Norman Lahr, Alexander Nilsson, and Robin Leander Schröder. Don’t reject this: Key-recovery timing attacks due to rejection-sampling in HQC and BIKE. IACR Transactions on Cryptographic Hardware and Embedded Systems, 2022. DOI: 10.46586/tches.v2022.i3.223-263

Google Scholar ePrint

[GM24]

Noah Golowich and Ankur Moitra. Edit distance robust watermarks for language models. arXiv preprint arXiv:2406.02633, 2024. DOI: 10.48550/arXiv.2406.02633

Google Scholar ePrint

[Goo24]

Google DeepMind. Our next-generation model: Gemini 1.5. Accessed: 2024-02-20. https://blog.google/technology/ai/google-gemini-next-generation-model-february-2024/. 2024.

Google Scholar ePrint

[GPT23]

GPTZero. GPTZero | The Trusted AI Detector for ChatGPT, GPT-4, & More. Accessed: 2023-10-05. https://gptzero.me/. 2023.

Google Scholar ePrint

[GSR19]

Sebastian Gehrmann, Hendrik Strobelt, and Alexander M Rush. GLTR: Statistical detection and visualization of generated text. arXiv preprint arXiv:1906.04043, 2019. DOI: 10.48550/arXiv.1906.04043

Google Scholar ePrint

[GW17]

Venkatesan Guruswami and Carol Wang. Deletion codes in the high-noise and high-rate regimes. IEEE Transactions on Information Theory, 2017. DOI: 10.1109/TIT.2017.2659765

Google Scholar ePrint

[HAAL23]

Jan Hendrik Kirchner, Lama Ahmad, Scott Aaronson, and Jan Leike. New AI classifier for indicating AI-written text. Accessed: 2023-10-05. https://openai.com/blog/new-ai-classifier-for-indicating-ai- written-text. 2023.

Google Scholar ePrint

[HBD⁺19]

Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751, 2019. DOI: 10.48550/arXiv.1904.09751

Google Scholar ePrint

[HLV02]

Nicholas J Hopper, John Langford, and Luis Von Ahn. Provably secure steganography. In Advances in Cryptology—CRYPTO 2002: 22nd Annual International Cryptology Conference. 2002. DOI: 10.1007/3-540-45708-9_6

Google Scholar ePrint

[Hoe94]

Wassily Hoeffding. Probability inequalities for sums of bounded random variables. The Collected Works of Wassily Hoeffding, 1994. DOI: 10.1007/978-1-4612-0865-5_26

Google Scholar ePrint

[JAL20]

Ganesh Jawahar, Muhammad Abdul-Mageed, and Laks VS Lakshmanan. Automatic detection of machine generated text: A critical survey. arXiv preprint arXiv:2011.01314, 2020. DOI: 10.48550/arXiv.2011.01314

Google Scholar ePrint

[JSM⁺23]

Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, and others. Mistral 7B. arXiv preprint arXiv:2310.06825, 2023. DOI: 10.48550/arXiv.2310.06825

Google Scholar ePrint

[KGW⁺23]

John Kirchenbauer, Jonas Geiping, Yuxin Wen, Jonathan Katz, Ian Miers, and Tom Goldstein. A watermark for large language models. In Proceedings of the 40th International Conference on Machine Learning. 2023.

Google Scholar ePrint

[KJGR21]

Gabriel Kaptchuk, Tushar M Jois, Matthew Green, and Aviel D Rubin. Meteor: Cryptographically secure steganography for realistic distributions. In Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security. 2021. DOI: 10.1145/3460120.3484550

Google Scholar ePrint

[KSK⁺23]

Kalpesh Krishna, Yixiao Song, Marzena Karpinska, John Wieting, and Mohit Iyyer. Paraphrasing evades detectors of AI-generated text, but retrieval is an effective defense. Advances in Neural Information Processing Systems, 36, 2023.

Google Scholar ePrint

[KTHL24]

Rohith Kuditipudi, John Thickstun, Tatsunori Hashimoto, and Percy Liang. Robust distortion-free watermarks for language models. Transactions on Machine Learning Research, 2024.

Google Scholar ePrint

[LPH⁺23]

Aiwei Liu, Leyi Pan, Xuming Hu, Shu'ang Li, Lijie Wen, Irwin King, and Philip S Yu. A private watermark for large language models. arXiv preprint arXiv:2307.16230, 2023. DOI: 10.48550/arXiv.2307.16230

Google Scholar ePrint

[LUY08]

Thomas Lavergne, Tanguy Urvoy, and François Yvon. Detecting fake content with relative entropy scoring. Proceedings of the 2008 International Conference on Uncovering Plagiarism, Authorship and Social Software Misuse, 2008.

Google Scholar ePrint

[MLK⁺23]

Eric Mitchell, Yoonho Lee, Alexander Khazatsky, Christopher D. Manning, and Chelsea Finn. DetectGPT: zero-shot machine-generated text detection using probability curvature. Proceedings of the 40th International Conference on Machine Learning, 2023.

Google Scholar ePrint

[MTDZ24]

Travis Munyer, Abdullah All Tanvir, Arjon Das, and Xin Zhong. DeepTextMark: A Deep learning-driven text watermarking approach for identifying large language model generated text. IEEE Access, 2024. DOI: 10.1109/ACCESS.2024.3376693

Google Scholar ePrint

[Ope23]

OpenAI. GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023. DOI: 10.48550/arXiv.2303.08774

Google Scholar ePrint

[PGM⁺19]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, and others. PyTorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems, 2019.

Google Scholar ePrint

[PSF⁺23]

Julien Piet, Chawin Sitawarin, Vivian Fang, Norman Mu, and David Wagner. Mark my words: Analyzing and evaluating language model watermarks. arXiv preprint arXiv:2312.00273, 2023. DOI: 10.48550/arXiv.2312.00273

Google Scholar ePrint

[QYH⁺24]

Wenjie Qu, Dong Yin, Zixin He, Wei Zou, Tianyang Tao, Jinyuan Jia, and Jiaheng Zhang. Provably robust multi-bit watermarking for AI-generated text via error correction code. arXiv preprint arXiv:2401.16820, 2024. DOI: 10.48550/arXiv.2401.16820

Google Scholar ePrint

[QZL⁺23]

Jipeng Qiang, Shiyu Zhu, Yun Li, Yi Zhu, Yunhao Yuan, and Xindong Wu. Natural language watermarking via paraphraser-based lexical substitution. Artificial Intelligence, 2023. DOI: 10.1016/j.artint.2023.103859

Google Scholar ePrint

[RBL⁺22]

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2022.

Google Scholar ePrint

[RSR⁺20]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 2020.

Google Scholar ePrint

[SC05]

Qibin Sun and Shih-Fu Chang. A secure and robust digital signature scheme for JPEG2000 image authentication. IEEE Transactions on Multimedia, 2005. DOI: 10.1109/TMM.2005.846776

Google Scholar ePrint

[SKB⁺23]

Vinu Sankar Sadasivan, Aounon Kumar, Sriram Balasubramanian, Wenxiao Wang, and Soheil Feizi. Can AI-generated text be reliably detected?. arXiv preprint arXiv:2303.11156, 2023. DOI: 10.48550/arXiv.2303.11156

Google Scholar ePrint

[TMS⁺23]

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, and others. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023. DOI: 10.48550/arXiv.2307.09288

Google Scholar ePrint

[WKR⁺19]

Sean Welleck, Ilia Kulikov, Stephen Roller, Emily Dinan, Kyunghyun Cho, and Jason Weston. Neural text generation with unlikelihood training. In International Conference on Learning Representations. 2019.

Google Scholar ePrint

[WM01]

Ping Wah Wong and Nasir Memon. Secret and public key image watermarking schemes for image authentication and ownership verification. IEEE transactions on image processing, 2001. DOI: 10.1109/83.951543

Google Scholar ePrint

[YAJK23]

KiYoon Yoo, Wonhyuk Ahn, Jiho Jang, and Nojun Kwak. Robust natural language watermarking through invariant features. arXiv preprint arXiv:2305.01904, 2023. DOI: 10.48550/arXiv.2305.01904

Google Scholar ePrint

[ZEF⁺24]

Hanlin Zhang, Benjamin L. Edelman, Danilo Francati, Daniele Venturi, Giuseppe Ateniese, and Boaz Barak. Watermarks in the sand: Impossibility of strong watermarking for language models. In Proceedings of the 41st International Conference on Machine Learning. 2024.

Google Scholar ePrint

[ZHR⁺19]

Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, Franziska Roesner, and Yejin Choi. Defending against neural fake news. Advances in Neural Information Processing Systems, 2019.

Google Scholar ePrint

[ZRG⁺22]

Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, and others. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022. DOI: 10.48550/arXiv.2205.01068

Google Scholar ePrint

[ZWQ21]

Zhongxiang Zheng, Anyu Wang, and Lingyue Qin. Rejection sampling revisit: how to choose parameters in lattice-based signature. Mathematical Problems in Engineering, 2021. DOI: 10.1155/2021/9948618

Google Scholar ePrint

PDF Open access

DOI: https://doi.org/10.62056/ahmpdkp10

History

Submitted: 2024-10-09
Accepted: 2024-12-03
Published: 2025-01-13

How to cite

Jaiden Fairoze, Sanjam Garg, Somesh Jha, Saeed Mahloujifar, Mohammad Mahmoody, and Mingyuan Wang, Publicly-Detectable Watermarking for Language Models. IACR Communications in Cryptology, vol. 1, no. 4, Jan 13, 2025, doi: 10.62056/ahmpdkp10.

Citations

There is at least one citation.

License

Copyright is held by the author(s)

This work is licensed under a Creative Commons Attribution (CC BY) license.