Using IP-Adapters sdxl with sdxl-turbo? Not working?

#21
by Zubi401 - opened

I'm trying to use ip adapters with sdxl_turbo (which seem to both have a sdxl 1.0 checkpoint as their base). I'm doing the following:

pipe = DiffusionPipeline.from_pretrained("stabilityai/sdxl-turbo", torch_dtype=torch.float16) 
pipe.vae = AutoencoderTiny.from_pretrained("madebyollin/taesdxl", torch_dtype=torch.float16)

adapter_id = "ip-adapter-plus-face_sdxl_vit-h.bin"
pipe.load_ip_adapter("h94/IP-Adapter", subfolder="sdxl_models", weight_name=adapter_id)

this gives me the following error:

KeyError                                  Traceback (most recent call last)
Cell In[14], line 2
      1 adapter_id = "ip-adapter-plus-face_sdxl_vit-h.bin"
----> 2 pipe.load_ip_adapter("h94/IP-Adapter", subfolder="sdxl_models", weight_name=adapter_id)

File ~/miniconda3/envs/sd_diff/lib/python3.10/site-packages/diffusers/loaders/ip_adapter.py:152, in IPAdapterMixin.load_ip_adapter(self, pretrained_model_name_or_path_or_dict, subfolder, weight_name, **kwargs)
    149     self.feature_extractor = CLIPImageProcessor()
    151 # load ip-adapter into unet
--> 152 self.unet._load_ip_adapter_weights(state_dict)

File ~/miniconda3/envs/sd_diff/lib/python3.10/site-packages/diffusers/loaders/unet.py:711, in UNet2DConditionLoadersMixin._load_ip_adapter_weights(self, state_dict)
    708 self.set_attn_processor(attn_procs)
    710 # create image projection layers.
--> 711 clip_embeddings_dim = state_dict["image_proj"]["proj.weight"].shape[-1]
    712 cross_attention_dim = state_dict["image_proj"]["proj.weight"].shape[0] // 4
    714 image_projection = ImageProjection(
    715     cross_attention_dim=cross_attention_dim, image_embed_dim=clip_embeddings_dim, num_image_text_embeds=4
    716 )

KeyError: 'proj.weight'

I've also tried setting the image encoder to the one mentioned in the IP-Adapters repo, but it still gives the same error. Any idea how to fix? Or is SDXL_turbo not supported yet?

e.g.

#I've also tried the non-large version but still get the same result...
pipe.image_encoder = CLIPVisionModelWithProjection.from_pretrained("laion/CLIP-ViT-bigG-14-laion2B-39B-b160k", ignore_mismatched_sizes=True)

Use the CLIP from the models folder. It is VIT-H. https://huggingface.co/h94/IP-Adapter/tree/main/models/image_encoder . It should work with all the vit-h sdxl ones. You need to do the same on comfy as well.

Sign up or log in to comment