Virtual Try-On using IP-Adapter Inpainting

Community Article Published March 29, 2024

image/jpeg

Virtual Try-On

When it comes to AI and fashion, 'Virtual Try-On' is one of the hottest, most sought after tools. In this blog we're going to build our own Virtual Try-On tool. There are already a few pretty good demos in 🤗 Spaces:

OOTDiffusion

image/jpeg

OOTDDiffusion has the open source code posted on Github.

Outfit Anyone

image/jpeg

Unfortunately the diffusion model is not provided on their Github. Also, it seems like you can only use their person images because it errored out when I tried to use mine.

IP-Adapter

We're going to build a Virtual Try-On tool using IP-Adapter! What is an IP-Adapter? To put it simply IP-Adapter is an image prompt adapter that plugs into a diffusion pipeline.

For Virtual Try-On, we'd naturally gravitate towards Inpainting. We paint (or mask) the clothes in an image then write a prompt to change the clothes to something else.

image/png

The problem here is that in standard Inpainting we can only use text to change the clothes. E.g. "pink dress". But I don't want any pink dress, I want a photo of a specific pink dress. That's where IP-Adapter comes in. Instead of "pink dress", I give the model an image prompt of a specific pink dress.

image/jpeg

I would encourage reading the IP-Adapter paper and checking out the GitHub Repo. Also 🤗 has its own IP-Adapter documention which explains it super well.

Implementation

Hugging Face Spaces

You can either jump straight into the Colab Notebook or read through the rest of the blog and then go through the notebook at the end. You'll need a GPU to run this code.

Install Libraries

pip install diffusers accelerate

Import Libraries

from diffusers import AutoPipelineForInpainting, AutoencoderKL
from diffusers.utils import load_image
import torch

Load the Pipline

This vae reduces the amount of RAM needed to run this model:

vae = AutoencoderKL.from_pretrained("madebyollin/sdxl-vae-fp16-fix", torch_dtype=torch.float16)

Let's load the SDXL 1.0 Inpainting model into a pipline:

pipeline = AutoPipelineForInpainting.from_pretrained("diffusers/stable-diffusion-xl-1.0-inpainting-0.1",
                                                     vae=vae,
                                                     torch_dtype=torch.float16,
                                                     variant="fp16",
                                                     use_safetensors=True
                                                    ).to("cuda")

This is where we load the IP-Adapter. This adapter connects to our inpainting pipeline and allows us to give the pipeline an image prompt (or IP):

pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="sdxl_models", weight_name="ip-adapter_sdxl.bin", low_cpu_mem_usage=True)

Load Images

First we'll load our main image with the person in it. I'm using 1024x1024 jpgs. load_image() lets you load an image from a local path or a public url image. These images must be PIL images. It is suggested to use a full body image with the model wearing as little clothing as possible. Clothing may influence the final results.

image = load_image('https://cdn-uploads.huggingface.co/production/uploads/648a824a8ca6cf9857d1349c/jpFBKqYB3BtAW26jCGJKL.jpeg').convert("RGB")

image/jpeg

Next we'll load our clothing image. I highly suggest using a clothing image that doesn't have a person in it like the example below. We won't use it here but I have a very simple clothing segmentation tool published to Github and Hugging Face.

ip_image = load_image('https://cdn-uploads.huggingface.co/production/uploads/648a824a8ca6cf9857d1349c/NL6mAYJTuylw373ae3g-Z.jpeg').convert("RGB")

image/jpeg

Finally we'll need to create a mask for the inpainting pipeline. There are a number of ways to do this. We can create the mask by simply drawing it or (since we're all AI enthusiasts here) by using an image segmentation model! The main benefit of using an image segmentation model is it will generate the mask automatically so that all we need to do is provide the person image and the clothing image and the code will take care of the rest.

Another thing we need to consider is which part of the person are we masking. The torso? The arms? The entire legs? The whole body? To answer this we would need to know what type of clothing is being diffused into the image. For example a sleeveless mini dress would only cover some of the torso and some of the legs, whereas a long sleeve gown will cover the entire body. In order to account for most types of clothing I propose masking the entire body. It is important to not mask the face so that the face in the final image looks like the face in the original image.

To do the body segmentation, let's use my segment body module which can generate a body mask with 1 import statement and 1 line of code:

from SegBody import segment_body
seg_image, mask_image = segment_body(image, face=False)

image/png

Generate Image

We need to define the set_ip_adapter_scale(). This method controls the amount of text or image conditioning to apply to the model. A value of 1.0 means the model is only conditioned on the image prompt. Lowering this value encourages the model to produce more diverse images, but they may not be as aligned with the image prompt. In Virtual Try-On we want to make sure the clothing in the final image looks like the IP image so we'll use a value of 1.0:

pipeline.set_ip_adapter_scale(1.0)

Now that we have our pipeline and images loaded we can finally generate an image. I was able to get pretty good results with these parameters but feel free to play around with them and read about them here.

final_image = pipeline(
    prompt="photorealistic, perfect body, beautiful skin, realistic skin, natural skin",
    negative_prompt="ugly, bad quality, bad anatomy, deformed body, deformed hands, deformed feet, deformed face, deformed clothing, deformed skin, bad skin, leggings, tights, stockings",
    image=image,
    mask_image=mask_image,
    ip_adapter_image=ip_image,
    strength=0.99,
    guidance_scale=7.5,
    num_inference_steps=100,
).images[0]

image/png

Not bad! We managed to build our own Virtual Try-On with 12 lines of Python code. Pretty awesome right?

Virtual Try-On Function

Let's put it all together in a function:

def virtual_try_on(img, clothing, prompt, negative_prompt, ip_scale=1.0, strength=0.99, guidance_scale=7.5, steps=100):
    _, mask_img = segment_body(img, face=False)
    pipeline.set_ip_adapter_scale(ip_scale)
    images = pipeline(
        prompt=prompt,
        negative_prompt=negative_prompt,
        image=img,
        mask_image=mask_img,
        ip_adapter_image=clothing,
        strength=strength,
        guidance_scale=guidance_scale,
        num_inference_steps=steps,
    ).images
    return images[0]

And to call the virtual try-on function this is all we need:

result = virtual_try_on(img=image, 
                        clothing=ip_image,
                        prompt="photorealistic, perfect body, beautiful skin, realistic skin, natural skin",
                        negative_prompt="ugly, bad quality, bad anatomy, deformed body, deformed hands, deformed feet, deformed face, deformed clothing, deformed skin, bad skin, leggings, tights, stockings")

Acknowledgements

Huge shoutout to the amazing 🤗 Diffusers team and the original IP-Adapter authors Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang.

About Me

Hello, my name is Tony Assi. I'm a designer based in Los Angeles. I have a background in software, fashion, and marketing. I currently work for an e-commerce fashion brand. Check out my 🤗 profile for more apps, models and datasets.

Feel free to send me an email at tony.assi.media@gmail.com with any questions, comments, business inquiries or job offers.