CLIP Needs Registers. And Gated MLPs. And +20M params.

I just want a new Text Encoder...

...for my Text-to-Image (Text-to-Video) AI! \o/
I recommend this one, the 'sweet spot' ckpt12: 👉 direct download 👈
Even lower modality gap (text 'more alike' to image, but less accurate): direct download
Enjoy! (You don't need to do anything else, they're just normal CLIP Text Encoders!)

The ViT (Vision Encoder) is basically a big mutant. Alas:
The full model .safetensors have the 'import clip' (OpenAI) structure inside.
It's just so you don't need to load any 'danger pickles'. :)
Alas, currently it runs with 'import clip' code (I'm working on a HF implementation, though!).
However, for now, I made an entire playground for the CLIP models (+ safetensors loading)! 🎉:
🌟 https://github.com/zer0int/CLIP-fine-tune-registers-gated ✨
All code for fine-tuning it yourself is also included on my Git! 🤗

The Vision Transformer has +4 tokens (Register Tokens).
...And gated ReLU MLPs inside each layer + final Fusion MLP.
+20M parameters (~430M -> now: ~450M)
It's now a CLIP with an extremely low modality gap.
See the table below for details.
And if you want to know more about modality gaps & all details please check out the GitHub!

Attention Heatmap, pre-trained OpenAI CLIP ViT-L/14:

This model, CLIP REG-XGATED:

Text-To-Image examples, Flux.1-dev, pure CLIP (no T5) guidance:

Task / Dataset	Metric	ViT-L/14 OpenAI (Pre-trained)	X-GATED (ckpt20 xtreme)	X-GATED (ckpt12 balanced)	X-GATED (ckpt12 balanced, ablated)
VoC-2007 (Multilabel)	mAP	0.7615	0.8140	0.8471	0.8247
MSCOCO Retrieval	Image Recall@5	0.2194	0.3565	0.3532	0.3349
	Text Recall@5	0.3034	0.5425	0.5278	0.5086
Linear Probe CIFAR-10	Acc@1	0.9535	0.9813	0.9813	0.9811
	Acc@5	0.9966	0.9997	0.9997	0.9997
	Mean Class Recall	0.9535	0.9813	0.9813	0.9811
MVT ImageNet/ObjectNet (Zero-Shot)	Accuracy	0.8453	0.8686	0.8830	0.8815
Linear Probe ILSVRC2012	Top-1	69.86%	66.43%	67.10%	68.99%
	Top-5	92.70%	91.52%	91.83%	92.64%
Modality Gap Metrics	Euclidean Gap ↓	0.8276	0.4740	0.5395	0.7486
	JSD ↓	0.5200	0.1601	0.1303	0.3310
	Wasserstein Distance ↓	0.4084	0.1742	0.2102	0.3262
	Img-Text Cos Sim (mean) ↑	0.2723	0.4926	0.4794	0.3634
	Img-Text Cos Sim (std)	0.0362	0.0814	0.0758	0.0537
	Text-Text Cos Sim (mean)	0.6807	0.6657	0.6896	0.6896
	Text-Text Cos Sim (std)	0.1344	0.1671	0.1535	0.1535

Bolded values represent the best performance for each metric.