i can't really make sense why some models work and some models just will not work at all.

by kellempxt - opened 5 days ago

Discussion

kellempxt

5 days ago

this got me chuckling cos i have no idea why even the original Long Clip L loves to change my cat into an asian girl.

kellempxt

5 days ago

•

edited 5 days ago

so i tried another... so with SDXL PONY, it's usually a hit and miss.

but adding more images here cos i wonder what people will generally obtain from their prompts... good images or... weird images...

an example of i dunno why this won't work.

and subsequently this one does work okay.

kellempxt

5 days ago

successive minor variations... and then this...

kellempxt

5 days ago

when using extremely high CFG values... i get something like this...

kellempxt

5 days ago

so i add skimming CFG as a workaround to avoid that oversaturation. variations below.

zer0int

Owner 5 days ago

•

edited 5 days ago

Well, "kitten on table" + "beautiful". "Beautiful" is weakly associated with humans, in CLIP. Though "adorable" + "kitten on table" would likely be worse, as adorable is even more strongly associated with humans.
"Kitten" is close to "kitty" which is related to "Hello Kitty" which is a Japanese brand which is associated with "Asian" which is again associated with Manga and Hentai.

Unfortunately, putting "human" + "face" + "person" in the negative prompt will shift the entire meaning of the embeddings dramatically, as it's such a strong concept (muchly trained on).

If you call it "magnificent", "mesmerizing", or "alluring" instead of "beautiful", you might have a chance of steering away from "human".

But yeah, AI has a weird way of learning patterns and relatedness; and I don't even know what "bigaspV1" was exactly trained on / is biased towards, even just CLIP's AI-weirdness alone is intriguing.

Did you know? When CLIP 'looks at' a cat with reflecting retinas, an angel neuron fires in the CLIP model. Because the "angel neuron" is multimodal and seems to encode all things that "indirect light, reflection, glow, have halos, holy, bright, albedo of the moon".
That's just one example of how "alien / AI-lien" AI learn things which can lead to outcomes such as what you see.

The PONY images might be, well, PONY specific, but if you look on reddit - there have been reports of the new PyTorch version messing up Flux.1 generative images in the same way. I haven't followed up with that, and I don't know what's up with PONY in particular, but such glitches can also result from PyTorch version* differences and how they interact with the models, cross attention / xformers, and whatnot. So it's always good to ask yourself "did I update Python packages recently?" when such glitches suddenly appear for a previously working configuration.
*Edit: That was a few months ago, I am not sure if current.

With PONY, I'd say "try a different sampler than Euler, and try various CFG scale", but you already implemented a "CFG hack".

Plus, you're using a large variety of arbitrary SDXL checkpoints, so those might very well be to blame, too. That's a complicated situation where only trial and error can help you - hope I had some useful tips for you, though. Good luck! :)

PS: Just in case you want to learn more about what / how CLIP "sees", you can feed it an image, get CLIP's "opinion", and then do model inversion to reveal the image that basically formed in CLIP's "brain" as the AI was processing the input:
https://github.com/zer0int/CLIPInversion

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment