Outpainting I - Controlnet version

Community Article Published April 23, 2024

There are at least three methods that I know of to do the outpainting, each with different variations and steps. This is the first one with controlnet, you can read about the other methods here:

Outpainting with controlnet requires using a mask, so this method only works when you can paint a white mask around the area you want to expand. With this method it is not necessary to prepare the area beforehand but it has the limit that the image can only be as big as your VRAM allows it.

1.- Original Image

For this case I'll use a wolf image that was provided by Laidawang in this comment:

2.- Outpaint

The starting prompt is a wolf playing basketball and I'll use the Juggernaut V9 model.

There's a controlnet for SDXL trained for inpainting by destitech named controlnet-inpaint-dreamer-sdxl. It's an early alpha version but I think it works well most of the time.

controlnet = ControlNetModel.from_pretrained(
    "destitech/controlnet-inpaint-dreamer-sdxl", torch_dtype=torch.float16, variant="fp16"
)

This controlnet model is really easy to use, you just need to paint white the parts you want to replace, so in this case what I'm going to do is paint white the transparent part of the image.

To paint white the alpha channel of an image I use this code:

response = requests.get("https://huggingface.co/datasets/OzzyGT/testing-resources/resolve/main/outpainting/313891870-adb6dc80-2e9e-420c-bac3-f93e6de8d06b.png?download=true")

control_image = Image.open(BytesIO(response.content))
new_controlnet_image = Image.new("RGBA", control_image.size, "WHITE")
new_controlnet_image.alpha_composite(control_image)

This image has a 720x720 px size, SDXL works better with 1024x1024 images, so the generations will be upscaled, you can use the smaller image but you'll probably get a lower quality image.

The conditional scale affects how much of the original image will be preserved, since this is an outpaint, it's safe to use higher values, for inpainting and complex images is better to use lower values around 0.5.

pipeline = StableDiffusionXLControlNetPipeline.from_pretrained(
    "RunDiffusion/Juggernaut-XL-v9",
    torch_dtype=torch.float16,
    variant="fp16",
    controlnet=controlnet,
).to("cuda")

image = pipeline(
    prompt=prompt,
    negative_prompt=negative_prompt,
    height=1024,
    width=1024,
    guidance_scale=6.5,
    num_inference_steps=25,
    generator=generator,
    image=new_controlnet_image,
    controlnet_conditioning_scale=0.9,
    control_guidance_end=0.9,
).images[0]

With the same seed I get these results:

0.2 0.5 0.7 0.9 1.0
image/png image/png image/png image/png image/png

I always prefer to allow the model to have a little freedom so it can adjust tiny details to make the image more coherent, so for this case I'll use 0.9

3.- Better outpaint with IP Adapter

At this point I think we are at the level of other solutions, but let's say we want the wolf to look just like the original image, for that I want to give the model more context of the wolf and where I want it to be so I'll use an IP adapter for that.

There's a little trick that works for me and is that I use the generated image I want as a base and paint the mask of the wolf over it and then use this as an attention mask for the IP Adapter.

The process I use to generate the mask is like this:

Base Image Wolf Mask Background
image/png image/png image/png

For me doing this with something like GIMP doesn't take me more than a minute to do.

I'm using the IP Plus Adapter with a scale of 0.4, and with these settings we now get these results:

image/png image/png

My guess as to why this works is that since we’re drawing a shape of a wolf and giving it an input image of a wolf, the model tries to maintain the shape, position, and coherence. In short, we’re giving the model more context for it to generate the final image, and now it looks a lot more like the original.

For example, if we don’t use a mask with this ip adapter the results are like these:

image/png image/png

4.- Better prompt

Now we're at the point where the image looks good but I always want more, so lets improve it with the prompt.

I'll change the prompt to this:

"high quality photo of a wolf playing basketball, highly detailed, professional, dramatic ambient light, cinematic, dynamic background, focus"

Also I'll give it a little more freedom to the controlnet with control_guidance_end=0.9 so it can finish the details without restrictions.

image/png image/png

5.- Image-to-Image pass

Finally, if you’ve ever worked with compositing images or video before, you’ll know that it is a common practice to apply a filter to the whole composition to unite the final look. This is true for stable diffusion too, and it sometimes hides the seams better, if there are any. For this, I’ll do a final pass to the whole image with an image-to-image pipeline.

Diffusers allows to just change the pipeline with a single line of code while preserving the loaded models which is what we need here, we just need to unload the controlnet model.

pipeline_img2img = AutoPipelineForImage2Image.from_pipe(pipeline, controlnet=None)

The VAE decoding is a lossy process, so everytime we encode or decode we're lossing details and lowering the quality of the image, to prevent this we need to stay in the latent space as much as possible.

Diffusers allows this if you pass output_type="latent" to the pipeline. We then feed the latents to the image to image pipeline but before that I want to also give it a more cinematic look, so I'll change the prompt again:

prompt = "cinematic film still of a wolf playing basketball, highly detailed, high budget hollywood movie, cinemascope, epic, gorgeous, film grain"

This should be just a quick pass, so I'll set the steps to 30 and the strength to 0.2.

These are the final results:

image/png image/png
image/png image/png

In my opinion these images are as good as or better than the outpainting done with other UIs and hopefully these will help people to better understand what you can do with diffusers.

6.- Outpaint tip with IP Adapter

As a bonus, there's a neat little trick. Before I used the IP Adapter with a mask to give more of the initial image to the generation. However, this can also be used without a mask. This has the effect of feeding the model more context and makes it better able to guess the rest of the image, for example with just the prompt "high quality":

Original Image Without IP Adapter With IP Adapter
image/png image/png image/png

So if I do what I did before but with the exception of the ip adapter mask, this is the final result:

image/png image/png

Don't ask me why is the Eiffel tower there ^^

The full code:

import random
from io import BytesIO

import requests
import torch
from PIL import Image

from diffusers import (
    AutoPipelineForImage2Image,
    ControlNetModel,
    DPMSolverMultistepScheduler,
    StableDiffusionXLControlNetPipeline,
)
from diffusers.image_processor import IPAdapterMaskProcessor
from diffusers.utils import load_image, logging
from diffusers.utils.logging import set_verbosity


set_verbosity(logging.ERROR)  # to not show cross_attention_kwargs..by AttnProcessor2_0 warnings

controlnet = ControlNetModel.from_pretrained(
    "destitech/controlnet-inpaint-dreamer-sdxl", torch_dtype=torch.float16, variant="fp16"
)

pipeline = StableDiffusionXLControlNetPipeline.from_pretrained(
    "RunDiffusion/Juggernaut-XL-v9",
    torch_dtype=torch.float16,
    variant="fp16",
    controlnet=controlnet,
).to("cuda")

pipeline.scheduler = DPMSolverMultistepScheduler.from_config(pipeline.scheduler.config)
pipeline.scheduler.config.use_karras_sigmas = True

pipeline.load_ip_adapter(
    "h94/IP-Adapter",
    subfolder="sdxl_models",
    weight_name="ip-adapter-plus_sdxl_vit-h.safetensors",
    image_encoder_folder="models/image_encoder",
)
pipeline.set_ip_adapter_scale(0.4)

ip_wolf_image = load_image(
    "https://huggingface.co/datasets/OzzyGT/testing-resources/resolve/main/outpainting/ip_wolf_source.png?download=true"
)

ip_mask = load_image(
    "https://huggingface.co/datasets/OzzyGT/testing-resources/resolve/main/outpainting/wolf_position_mask.png?download=true"
)

processor = IPAdapterMaskProcessor()
ip_masks = processor.preprocess(ip_mask, height=1024, width=1024)

response = requests.get(
    "https://huggingface.co/datasets/OzzyGT/testing-resources/resolve/main/outpainting/313891870-adb6dc80-2e9e-420c-bac3-f93e6de8d06b.png?download=true"
)
control_image = Image.open(BytesIO(response.content))
new_controlnet_image = Image.new("RGBA", control_image.size, "WHITE")
new_controlnet_image.alpha_composite(control_image)

prompt = "high quality photo of a wolf playing basketball, highly detailed, professional, dramatic ambient light, cinematic, dynamic background, focus"
negative_prompt = ""

seed = random.randint(0, 2**32 - 1)
generator = torch.Generator(device="cpu").manual_seed(seed)

latents = pipeline(
    prompt=prompt,
    negative_prompt=negative_prompt,
    height=1024,
    width=1024,
    guidance_scale=6.5,
    num_inference_steps=25,
    generator=generator,
    image=new_controlnet_image,
    controlnet_conditioning_scale=0.9,
    control_guidance_end=0.9,
    ip_adapter_image=ip_wolf_image,
    cross_attention_kwargs={"ip_adapter_masks": ip_masks},
    output_type="latent",
).images[0]

pipeline_img2img = AutoPipelineForImage2Image.from_pipe(pipeline, controlnet=None)

prompt = "cinematic film still of a wolf playing basketball, highly detailed, high budget hollywood movie, cinemascope, epic, gorgeous, film grain"

image = pipeline_img2img(
    prompt=prompt,
    negative_prompt=negative_prompt,
    guidance_scale=3.0,
    num_inference_steps=30,
    generator=generator,
    image=latents,
    strength=0.2,
    ip_adapter_image=ip_wolf_image,
    cross_attention_kwargs={"ip_adapter_masks": ip_masks},
).images[0]

image.save("result.png")