CLIP Needs Registers. And Gated MLPs. And +20M params.
Fixing CLIP's modality gap via happy little accidents.
- ❤️ this CLIP? Donate if you can / want. TY! (and enjoy, either way!) 🤗
I just want a new Text Encoder...
- ...for my Text-to-Image (Text-to-Video) AI! \o/
- I recommend this one, the 'sweet spot' ckpt12: 👉 direct download 👈
- Even lower modality gap (text 'more alike' to image, but less accurate): direct download
- Enjoy! (You don't need to do anything else, they're just normal CLIP Text Encoders!)
⚠️ Full model (currently) not HuggingFace Transformers compatible. ⚠️
- The ViT (Vision Encoder) is basically a big mutant. Alas:
- The full model .safetensors have the 'import clip' (OpenAI) structure inside.
- It's just so you don't need to load any 'danger pickles'. :)
- Alas, currently it runs with 'import clip' code (I'm working on a HF implementation, though!).
- However, for now, I made an entire playground for the CLIP models (+ safetensors loading)! 🎉:
- 🌟 https://github.com/zer0int/CLIP-fine-tune-registers-gated ✨
- All code for fine-tuning it yourself is also included on my Git! 🤗
Wait, but what is this?!
- The Vision Transformer has +4 tokens (Register Tokens).
- ...And gated ReLU MLPs inside each layer + final Fusion MLP.
- +20M parameters (~430M -> now: ~450M)
- It's now a CLIP with an extremely low modality gap.
- See the table below for details.
- And if you want to know more about modality gaps & all details please check out the GitHub!
An image is worth 16x16 words, alas:
Attention Heatmap, pre-trained OpenAI CLIP ViT-L/14:
Text-To-Image examples, Flux.1-dev, pure CLIP (no T5) guidance:
Model Performance Overview
Task / Dataset | Metric | ViT-L/14 OpenAI (Pre-trained) | X-GATED (ckpt20 xtreme) | X-GATED (ckpt12 balanced) | X-GATED (ckpt12 balanced, ablated) |
---|---|---|---|---|---|
VoC-2007 (Multilabel) | mAP | 0.7615 | 0.8140 | 0.8471 | 0.8247 |
MSCOCO Retrieval | Image Recall@5 | 0.2194 | 0.3565 | 0.3532 | 0.3349 |
Text Recall@5 | 0.3034 | 0.5425 | 0.5278 | 0.5086 | |
Linear Probe CIFAR-10 | Acc@1 | 0.9535 | 0.9813 | 0.9813 | 0.9811 |
Acc@5 | 0.9966 | 0.9997 | 0.9997 | 0.9997 | |
Mean Class Recall | 0.9535 | 0.9813 | 0.9813 | 0.9811 | |
MVT ImageNet/ObjectNet (Zero-Shot) | Accuracy | 0.8453 | 0.8686 | 0.8830 | 0.8815 |
Linear Probe ILSVRC2012 | Top-1 | 69.86% | 66.43% | 67.10% | 68.99% |
Top-5 | 92.70% | 91.52% | 91.83% | 92.64% | |
Modality Gap Metrics | Euclidean Gap ↓ | 0.8276 | 0.4740 | 0.5395 | 0.7486 |
JSD ↓ | 0.5200 | 0.1601 | 0.1303 | 0.3310 | |
Wasserstein Distance ↓ | 0.4084 | 0.1742 | 0.2102 | 0.3262 | |
Img-Text Cos Sim (mean) ↑ | 0.2723 | 0.4926 | 0.4794 | 0.3634 | |
Img-Text Cos Sim (std) | 0.0362 | 0.0814 | 0.0758 | 0.0537 | |
Text-Text Cos Sim (mean) | 0.6807 | 0.6657 | 0.6896 | 0.6896 | |
Text-Text Cos Sim (std) | 0.1344 | 0.1671 | 0.1535 | 0.1535 |
Bolded values represent the best performance for each metric.
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support
HF Inference deployability: The model has no library tag.