Diffusion Single File
comfyui

Gemma 12B QAT as text encoder for Anima 2.0?

#208
by Downtown-Case - opened

Call this a left-field suggestion, but:

  • It's much smarter than Qwen 4B.

  • ...But only 7GB in its full precision.

  • It was natively trained for image and audio inputs, not as an addon mmproj like other LLMs.

If it was incorporated into a future Anima, that would give it the uniquely strong ability to "prompt" the text encoder with an image. Or audio!

Even its uncensored version couldn't output accurate NSFW content in Image2Text, but Qwen could.

Even its uncensored version couldn't output accurate NSFW content in Image2Text, but Qwen could.

Gemma is actually perfectly capable of generating accurate NSFW text descriptions for images without even needing an uncensored version; the stock model can achieve this with a simple prompt jailbreak. It's possible you didn't explicitly instruct the model to provide precise NSFW details, which is why it fell short. That said, the 12B model does suffer from rather severe visual hallucinations in certain scenarios.

Like I explained in another discussion here, the most likely base model for the next Anima would be Cosmos 3. That uses a self-contained Qwen3 text encoder model, which happens to be the same architecture of the text encoder that Anima 1.0 uses. However the text encoder is already the same size as the diffusion model itself so honestly I don't think it's worth swapping. In fact, it would be more worth pulling the text encoder out of the model itself so you only store one text encoder and not download an extra +16GB for every checkpoint you download (cough cough SDXL CLIP... at least it's smaller!)

Sign up or log in to comment