Error when passing the "padding_mask_crop" parameter in StableDiffusionInpaintPipeline

#43
by mayonaisu - opened
import PIL
import requests
import torch
from io import BytesIO

from diffusers import StableDiffusionInpaintPipeline


def download_image(url):
    response = requests.get(url)
    return PIL.Image.open(BytesIO(response.content)).convert("RGB")


img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png"
mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png"

init_image = download_image(img_url).resize((512, 512))
mask_image = download_image(mask_url).resize((512, 512))

pipe = StableDiffusionInpaintPipeline.from_pretrained(
    "runwayml/stable-diffusion-inpainting", torch_dtype=torch.float16
)
pipe.to('cuda')
prompt = "Face of a yellow cat, high resolution, sitting on a park bench"
image = pipe(prompt=prompt, image=init_image, mask_image=mask_image, padding_mask_crop=4).images[0]

When passing the padding_mask_crop and setting it to an int value, I am getting the following error

ValueError: The UNet should have 4 input channels for inpainting mask crop, but has 9 input channels.

The docs for stable-diffusion-inpainting mentions that

For inpainting, the UNet has 5 additional input channels (4 for the encoded masked-image and 1 for the mask itself) whose weights were zero-initialized after restoring the non-inpainting checkpoint.

So, can this parameter be even used in this pipeline? If so, how to fix this ?

And what are some appropriate values for this parameter?

It wants you to use a non-inpainting checkpoint (e.g "runwayml/stable-diffusion-v1-5" which has 4 UNet channels, vs 9 for an inpainting specific checkpoint). Appropriate values might be 32, 64, 96, 128 in pixels. It's meant to do the same thing as "Inpaint only masked" in AUTOMATIC1111 SDWUI. So basically it crops your input image and mask based on the minimum size rectangle that can contain your mask plus the padding_mask_crop, upscales that to the original image size, does the inpainting(which in this case is more like img2img, I think, hence wanting a normal checkpoint? I'm unsure on that though) at full res, and then at the end scales the result back down and merges it into the original image. This allows you to touch up faces/hands/details at higher resolution which leads to better results. See this PR for more info(as well for the associated mask_blur feature): https://github.com/huggingface/diffusers/pull/6072 - Hope this helps!

without padding_mask_crop (left: inpainted, right:original)
image.png
with padding_mask_crop=128 (left: inpainted, right:original)
image.png
There is improvement in face and the overall image quality is retained, but there is a faded tinge of the mask appearing on the inpainted face. Not sure why. Also is it a better idea to resize the original image to (512, 512) and mask, since this model generates that resolution by default

The tinge of mask is because this method of inpainting fully preserves the unmasked area whereas normal inpainting will slightly alter those areas around the mask to blend better. The "mask_blur" can help overcome this, try setting it to 25.

As far as resizing, most models can handle some amount of deviation from their trained resolution before they have issues. Unless you're seeing issues like duplication, torso stacking or else hitting hardware limitations, you shouldn't need to worry about it.

Sign up or log in to comment