AssertionError: You do not have CLIP state dict!

#2
by PixelClassisist - opened

I get the following error when trying to use this in Forge. Your text Detail improved HiT model works fine though. Any ideas?

Owner

Could you specify what you mean by "this" - which model exactly is not working for you? Make sure you use the version that worked with HiT; e.g. if you used the Text Encoder only that has TE-only in the filename for the HiT, then also try the TE-only version of ['this' model you were referring to].

Could you specify what you mean by "this" - which model exactly is not working for you? Make sure you use the version that worked with HiT; e.g. if you used the Text Encoder only that has TE-only in the filename for the HiT, then also try the TE-only version of ['this' model you were referring to].

Thanks for the reply. I'm referring to "Long-ViT-L-14-BEST-GmP-smooth-ft.safetensors". Currently I'm using "ViT-L-14-TEXT-detail-improved-hiT-GmP-TE-only-HF" and this one works fine, however, I often use very long prompts, so I thought the Long version might be better suited. In the files and versions tabs of "Long-ViT-L-14-BEST-GmP-smooth-ft.safetensors" I can't see a TE-only option. Am I missing something perhaps?

Owner

Oh, I am sorry about my confusion! /o
I just clicked this in "inbox" and failed to see we're discussing Long-CLIP, not "normal CLIP". Sorry about that!

You need to adjust (expand) the embeddings and "inject" the Long-CLIP model for that to work.
https://github.com/SeaArtLab/ComfyUI-Long-CLIP did so for SD, SDXL - while I contributed the Flux node via a pull request.

Unfortunately, I don't use Forge (or much inference at all; my art became tweaking the model itself, not so much generating images, haha!). But I hope the details for ComfyUI will serve as guidance for what you'd need to implement with Forge. Or to request the implementation into Forge with the authors of Forge / the community.

Hope that helps / is a starting point, at least!

Any solution for forge? All the models fail with "ValueError: Failed to recognize model type!"

UPD1 : having "CLIP" in the filename helps to load regular clip but longclip still has an error:
RuntimeError: Error(s) in loading state_dict for IntegratedCLIP: size mismatch for transformer.text_model.embeddings.position_embedding.weight: copying a param with shape torch.Size([248, 768]) from checkpoint, the shape in current model is torch.Size([77, 768]).

UPD2: Editing this line to 248 helped!

Owner

@ceoofcapybaras - glad you figured it out already! In the long-term, I guess opening an Issue on the repo / asking for implementation of Long-CLIP in Forge would be the best option, so it's available to everybody (and not just to those willing to peek around and edit the code).

Sign up or log in to comment