Related Studies on Watermarks in Large Language Model

I’m writing this blog to conclude some of the representative papers about watermarks in large language models.

Table of Contents

Introduction


Watermarks were first introduced in Fabriano, Italy, in 1282. At the time, watermarks were created by changing the thickness of paper during a stage in the manufacturing process when it was still wet.

(Wikipedia)


It would be great if large language models could add watermarks to the texts they generate. With watermarks on AI-generated text, the user could verify whether a segment of text was generated by a specific LLM, while the owner of the LLM could claim or deny responsibility for the text generated by the LLM.

Unlike watermarks on images, it is much more difficult to add watermarks to texts without distorting the original meaning. One cannot simply insert a segment of text as a watermark, as it can be easily removed. Therefore, watermarks for LLMs should be embedded in the text in a way that preserves the semantics, while a specific process can be followed to extract the watermark and verify if the text was generated by the LLM.

Methodologies

Adding watermarks in LLM-generated texts can be methodologically categorized into: Rule-based watermarking, Inference-time watermarking, Neural-based watermarking [1].

Rule-Based Watermarking

This approach integrates watermarks into LLM-generated text by substituting certain words with their synonyms or performing syntactic transformations.

The idea is quite straight-forward, hence somewhat “naive”. A motivation for studies using this method could be that a third party wants to insert their own watermarks to LLM-generated texts under a blackbox senario [2].

The advantage of this method is its high preservation of semantics. The disadvantage is that such method is less robust against watermark detection and removal attacks.

A representative work using this method is Robust Multi-bit Natural Language Watermarking through Invariant Features[3] published in 2023. The authors used two phases to generate watermarks under a blackbox setting. Given a segment of text (which they called cover text), their Phase 1 is to find words that are ‘invariant’ to corruption, which means attackers need to change a significant portion of the text to alter this word. Their Phase 2 is to use a pre-trained infill model to generate replacements of the chosen words in Phase 1. It appears that the extraction of watermark is done by carrying out the same process on the given text and check if the produced replacement words are the same as the given ones (they expect these words to be somewhat optimal). The experiment was done with IMDB, Wikitext, and two English novels, they considered a insertion/deletion/substitution of words on 2.5% - 5% of the text (in every each N words) as a corruption.

Inference-time Watermarking

Neural-based Watermarking

Use Cases

Reference

  1. The Science of Detecting LLM-Generated Texts
  2. Watermarking Text Generated by Black-Box Language Models
  3. Robust Multi-bit Natural Language Watermarking through Invariant Features
Written on September 21, 2024