On November 28, 2023, Stability AI released SDXL Turbo, an advanced AI image generation model. Just two days later, IMATAG made available a robust and invisible watermarking solution on 🤗 Hugging Face to identify images generated by this AI. In this demo, users can generate an image by providing a prompt describing the desired image. An imperceptible watermark is inserted during the image generation process, certifying it as synthetic. The demo also allows for watermark detection after subjecting the generated image to various attacks, such as compression or recolorization.
Traditionally, watermarking occurred after content creation. However, with generative AI, a more suitable approach is to introduce watermarking during the generation process. The technique used in the demo is based on the refinement of a process developed by scientists at INRIA and Meta, known as StableSignature. It involves modifying the weights of the generative AI model (in this case, SDXL Turbo) to make it naturally generate watermarked images. As such, it comes at no additional cost in the generative process, and is harder to remove from the processing pipeline. For enhanced robustness, IMATAG combined this method with its in-house zero-bit decoder (indicating the presence or absence of a watermark, without payload), capable of decoding even highly altered watermarks.
Also, by adapting the method to 🤗 Hugging Face's 🧨 diffusers library, IMATAG made it possible to add robust and invisible watermarking to the generation pipeline in just 5 lines of code!
from diffusers.models import AutoencoderKL
from diffusers import StableDiffusionXLPipeline
model = "stabilityai/sdxl-turbo"
vae = AutoencoderKL.from_pretrained("imatag/stable-signature-bzh-sdxl-vae-medium")
pipe = StableDiffusionXLPipeline.from_pretrained(model, vae=vae)
... and here is how we did it!
The general idea behind Stable Signature is to fine-tune the VAE decoder of Stable Diffusion to make it directly produce the specific watermark signal expected by a differentiable detector for a fixed key. The principle is very similar to a targeted adversarial attack, except we want to boost the response of the watermark detector for a fixed key rather than boosting a classifier's response on a target class. Also, instead of directly modifying the pixels of the image as in SSL watermarking, the gradient information is back-propagated one stage further into the weights of the VAE decoder (Fig. 2 (b) of Meta's blog article).
Let's stop there for a second and think about the implications. First of all, the watermark is merged in the weights of the VAE decoder, making it hard to remove unless the original weights were made public (which unfortunately is the case for Stable Diffusion). It also means the watermark comes at no additional computational cost compared to a non-watermarked generation. However, the VAE decoder structure adds some constraints on the watermark. One is that it tiles the output image, meaning the coarse features of the watermark are not reproduced. Also, adjusting the output size of the generation will change the number of patches the detector can rely on, but not their scale, which is fixed. This means overall the watermark is naturally robust to crop but not to scale: the original paper shows that even only 10% of the original pixels are enough to find the watermark with strong confidence, but "The resize and JPEG 50 transformations seems to be the most challenging ones, and sometimes get bellow 0.9 [bit accuracy]" as it relies only on the pretraining of the watermark detector. Another point is that the perceptibility of the watermark depends on how much the initial model was fine-tuned, and may be hard to control. Instead of a fixed PSNR/SSIM budget, the strength of the watermark is controlled by the learning rate and lambda of the loss function used during fine-tuning. Finally, even if the detector is kept the same, the fine-tuning procedure needs to be done from stratch for each secret key to detect.
IMATAG's BZH (Blind Zero-bit Hidding) watermarking system is derived from HiDDeN, like the detector used in Stable Signature. However, instead of decoding a binary message, we rely on zero-bit watermarking and extract a high-dimensional vector that we correlate with the vector generated from the key. For a random key, this expected vector is uniformly drawn on the surface of the unit hypersphere in dimension . Detecting the watermark on a query content then amounts to computing the probability that by chance, when running the detector on a non-watermarked image, we obtain a higher correlation than the one observed on the query content. This is called the p-value of rejecting the null hypothesis (H0) and leads to the probability of claiming wrongly that a content is watermarked while it's not (false positive), for a given threshold. It turns out it can be computed analytically from the area of a hyperspherical cap by evaluating the regularized incomplete beta function :
However, since the key is fixed in this use case, we need to be careful on how the output of the detector behaves when we run it on a dataset of random unwatermarked images. Indeed, as noted with the binary output of Stable Signature, in this case the output of the detector may be correlated and uncentered, preventing computation of the p-value with the formula above as it is not distributed uniformly on the hypersphere. However by learning a whitening transform of the output on a dataset of images, one can restore this assumption to a high level of confidence. Here we used Flickr100k to train this linear transform and perform ZCA whitening. Under H0 the p-value should be uniformly distributed. We tested this hypothesis on 5000 AI-generated images by visualizing the histogram of the p-value, which should be flat, and performing a Kolmogorov–Smirnov test to check if we could reject this null hypothesis:
Other improvements compared to HiDDeN include retraining for crop/resize/JPEG but also recapture attacks, better aggregation procedure, strict control on watermark MSE, support for masked input in the detector, etc…
We changed a few things in the Stable Signature procedure to adapt it to our needs. First of all, since our detector is zero-bit, the loss function to optimize was changed from binary cross-entropy between the expected message and predicted message to negative cosine similarity between the expected vector and the extracted vector , which are both normalized. So, using the same notations as the paper,
Then, rather than using a perceptual loss (LPIPS) to constrain the decoded patches to be similar to the non-watermarked patches, we reverted to the initial training loss of the KL-autoencoder (Stable Diffusion paper, appendix G, equation (25)) which is only interested in reconstructing the input image in a plausible way:
We believe this is a better objective as the non-watermarked model is not supposed to be released, giving more freedom to how we can generate patches. Also, reintroducing the discriminator term to ensure the distribution of decoded patches is hard to distinguish from the distribution of the original patches feels important, especially at high distorsion rates. Overall we want to use the degrees of freedom of the decoder to reconstruct plausible patches while including the watermark, rather than fitting to a preexisting model that is not supposed to be available anyway. Unfortunately, the discriminator learned during the KL-autoencoder training of Stable Diffusion is part of the loss and was not released (to our knowledge), so it has to be relearned from scratch during the fine-tuning procedure. We finetuned multiple models for various compromises between perceptibility and robustness by simply varying the parameter of the final loss:
Finally, we preprocess the image to detect with a fixed aspect-ratio-preserving rescale to a rectangle of 256-pixel side. This corresponds to the settings on which BZH was trained, and allows the watermark to be naturally robust to downscale. However it means the VAE decoder is trained for a specific generation resolution and has to be retrained if output size changes. We think this is ok as most generative models are working at fixed size (512px, 768px, 1024px) and generating at a different resolution may produce artifacts.
We compared this watermarking solution to a few others in terms of robustness and perceptibility. We generated ~5000 images with SDXL-turbo using COCO 2017 validation captions as prompts. As a baseline we used DCTDWT from the invisible-watermark package, since it is the default watermark used in the StableDiffusion original code and pipeline of 🤗 HuggingFace's 🧨 diffusers library. We compared also to watermarking after generation with our internal BZH watermarker (the one corresponding to the detector we finetune for), and IMATAG's production watermark (lamark). The solutions are evaluated by computing the decoder p-values and plotting the corresponding ROC curves. The false-positive rate axis uses logarithmic scale to better assess the performance at very low rates, which is generally the regime we are interested in. P-value thresholds of 1e-12 (almost no errors per test) or 1e-3 (generally used in the literature) are represented with light gray vertical lines. StableSignature variants are represented with solid lines while post-watermarking solutions are presented with dashed lines. Here's what we get with no attacks:
Among all the watermarks, DCTDWT (operating at average PSNR 42.6dB) and the weak model are the worst. They are still better than "none" which corresponds to random detection. The p-value for DCTDWT is computed by counting the number of matching bits (m) in its 48-bit message and assuming the bits are random and equiprobable under H0. Then the p-value is given by:
and there is always a chance among 2⁴⁸ that the code is correct. Therefore the performance at false positive rate below 1/2⁴⁸ is undefined, which is why the dashed gray line stops at this value. However note that although the bits of the DCTDWT key of Stable Diffusion were drawn randomly and equiprobably, the output of DCTDWT was not whitenened, contrary to our models and the original StableSignature paper. Therefore the assumption above does not actually hold in practice and the p-values for this system are indicative only. Using DCTDWT to claim whether a content is watermarked or not should be done with extreme care. BZH1, BZH2 and BZH3 correspond to three different levels of watermark power, with an average PSNR of 46.8dB (almost invisible), 43.8dB, and 41.9dB (slightly visible) respectively. They all compare well to their StableSignature counterpart, with the "extreme" model being very visible but on par with BZH1 in terms of performance. Note that without attacks, IMATAG's production watermark (lamark) operating at average PSNR 45.0dB is detected perfectly, so we don't show it in this graph.
This would suggest post-watermarking wins.. but what if we start altering the image before detection?
The graph above shows performance after the "combined" attack used in the StableSignature paper: brighten of 1.5, central crop 50% followed by JPEG compression at 80% quality. First of all, DCTDWT is not robust to this attack, showing performance that is worse than a random guess. The StableSignature-based approaches stay competitive with respect to post-watermarking, with BZH1 now on par with the weak model. This attack is still very easy for lamark which beats all other methods by a large margin. If we attack even stronger, with 1.5 brightening, 2x downscale, 50% crop and 50% JPEG compression we get this:
Now most watermarks are struggling a lot to resist this attack, with BZH2 and lamark being on par in terms of performance, and only BZH3 maintaining a decent level of performance, at the cost of being quite visible. Here's what the images look like:
Releasing the watermarker has the security risk that it allows anyone to watermark content. In the context of generative AI, it means anyone can take a real image and add a watermark to it, pretenting it is generated. In the particular case of Stable Signature this is done by encoding and decoding the image with the watermarked VAE. However, one can argue that the image has been processed and is therefore not authentic anymore. In any case, detecting the watermark simply means that the image has gone through this specific watermarking VAE decoder, to which anyone has access. Also, by watermarking many images with the same watermarker, one can learn a detector from it.
Releasing the full detector enables anyone to extract any watermark from any content. Even if the watermarker is kept private, this allows to just read the expected signal and retrain a new watermarker that will produce it. Also, giving access to the binary decision only still allows to train a proxy detector and perform adversarial attacks. By the way, these were called oracle attacks by the watermarking communauty long before the adversarial term became popular. Although some techniques exist to mitigate this issue, access to the detector decisions (e.g. via API) should always be controlled and limited.
Also, one aspect we did not discuss yet is the secret key. The mere fact that it is fixed makes the solution insecure. For a secure solution, one would need to have multiple keys, change it often enough and run the detection on all the keys used beforehand. Ideally the key should change after each use, like a one-time-password, but it makes detection untractable and less robust. Also, the keyed watermarker and detector should be kept secret.
Finally, for the particular case of Stable Signature, Stable diffusion’s original VAE was released publicly, making it easy to attack the watermark strongly (as was noted in the paper), or to extract the watermark and move it to another image.
So, where do we go from there?
Well, first of all we do not pretend the watermarking solution presented in our demo is secure. Also, we know it is not robust to advanced attacks (such as diffusion purification with the original VAE) or even some simple ones such as a flip (although one could simply run the detector twice in this case) or rotations. For more secure solution, robustness to specific attacks, advanced features such as payload support, etc.. one should still contact IMATAG.
However the good news is that since security is broken anyway for the demo watermarking system with a fixed key, maybe we can release more than just the watermarker. Indeed, as long as we can detect the demo watermark and this watermark only we do not weaken the security of the system for another secret key. In order to do so, we trained a Resnet-18 proxy classifier on the watermarked images, just as an adversarial attacker would do on a black box classification system. But we actually have white box access to our full detector, so we could use knowledge distillation instead to have a better detector. Since we don’t want to leak too much information on the teacher, we don’t want to mimic it when the image is not watermarked. So we ended up using a mix of a classification loss (on non-watermarked images) and knowledge distillation loss (on watermarked images) to train the resnet18 student.
This resulted in a binary classifier able to detect the demo watermark. Unlike the full detector we only have a binary decision, not a p-value estimate. In order to recalibrate the detector, we computed the logits on 1M images from the Flickr1M dataset and stored them. An approximate p-value is then obtained by simply counting the proportion of logits from these samples that are below the observed logit on the image to detect. Although a crude estimate, it provides a fast way to detect with moderate confidence and robustness most watermarked images.
We are quite happy that this makes the demo watermarking system self-contained for basic use cases and way more robust to unintentional attacks than previous public solutions. It is a step in the right direction to provide open watermarking systems for generative AI, and now our focus should be on improving their security as well!
To conclude, DCTDWT (invisible-watermark) is not good enough to resist even unintentional attacks. It is also hard to evaluate its false positive detection rate as the messages are not equiprobable. Methods based on StableSignature are competitive if one wants to benefit from its easy integration in diffusion model generation at zero additional computational cost, and the difficulty of removing it (one can't comment a line of code to do so). Combining it with BZH enabled a robust and self-contained watermarking solution. However, in both cases the security is hindered by the fixed (and public) key. Post watermarking with the corresponding watermarker allows for better control over distorsion, is less perceptible, and provides easier key management. Finally, IMATAG's production watermark, working with both real and generated images, is still the best if one wants to maximize both security and the perceptibility/robustness compromise.