CLIP Needs Registers. And Gated MLPs. And +20M params.

Fixing CLIP's modality gap via happy little accidents.

  • ❤️ this CLIP? Donate if you can / want. TY! (and enjoy, either way!) 🤗

image/png

I just want a new Text Encoder...

  • ...for my Text-to-Image (Text-to-Video) AI! \o/
  • I recommend this one, the 'sweet spot' ckpt12: 👉 direct download 👈
  • Even lower modality gap (text 'more alike' to image, but less accurate): direct download
  • Enjoy! (You don't need to do anything else, they're just normal CLIP Text Encoders!)

⚠️ Full model (currently) not HuggingFace Transformers compatible. ⚠️

  • The ViT (Vision Encoder) is basically a big mutant. Alas:
  • The full model .safetensors have the 'import clip' (OpenAI) structure inside.
  • It's just so you don't need to load any 'danger pickles'. :)
  • Alas, currently it runs with 'import clip' code (I'm working on a HF implementation, though!).
  • However, for now, I made an entire playground for the CLIP models (+ safetensors loading)! 🎉:
  • 🌟 https://github.com/zer0int/CLIP-fine-tune-registers-gated
  • All code for fine-tuning it yourself is also included on my Git! 🤗

Wait, but what is this?!

  • The Vision Transformer has +4 tokens (Register Tokens).
  • ...And gated ReLU MLPs inside each layer + final Fusion MLP.
  • +20M parameters (~430M -> now: ~450M)
  • It's now a CLIP with an extremely low modality gap.
  • See the table below for details.
  • And if you want to know more about modality gaps & all details please check out the GitHub!

An image is worth 16x16 words, alas:

Attention Heatmap, pre-trained OpenAI CLIP ViT-L/14: image/png

This model, CLIP REG-XGATED: image/png

Text-To-Image examples, Flux.1-dev, pure CLIP (no T5) guidance:

image/png

image/png

Model Performance Overview

Task / Dataset Metric ViT-L/14 OpenAI (Pre-trained) X-GATED (ckpt20 xtreme) X-GATED (ckpt12 balanced) X-GATED (ckpt12 balanced, ablated)
VoC-2007 (Multilabel) mAP 0.7615 0.8140 0.8471 0.8247
MSCOCO Retrieval Image Recall@5 0.2194 0.3565 0.3532 0.3349
Text Recall@5 0.3034 0.5425 0.5278 0.5086
Linear Probe CIFAR-10 Acc@1 0.9535 0.9813 0.9813 0.9811
Acc@5 0.9966 0.9997 0.9997 0.9997
Mean Class Recall 0.9535 0.9813 0.9813 0.9811
MVT ImageNet/ObjectNet (Zero-Shot) Accuracy 0.8453 0.8686 0.8830 0.8815
Linear Probe ILSVRC2012 Top-1 69.86% 66.43% 67.10% 68.99%
Top-5 92.70% 91.52% 91.83% 92.64%
Modality Gap Metrics Euclidean Gap ↓ 0.8276 0.4740 0.5395 0.7486
JSD ↓ 0.5200 0.1601 0.1303 0.3310
Wasserstein Distance ↓ 0.4084 0.1742 0.2102 0.3262
Img-Text Cos Sim (mean) ↑ 0.2723 0.4926 0.4794 0.3634
Img-Text Cos Sim (std) 0.0362 0.0814 0.0758 0.0537
Text-Text Cos Sim (mean) 0.6807 0.6657 0.6896 0.6896
Text-Text Cos Sim (std) 0.1344 0.1671 0.1535 0.1535

Bolded values represent the best performance for each metric.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for zer0int/CLIP-Registers-Gated_MLP-ViT-L-14

Finetuned
(65)
this model
Finetunes
2 models

Dataset used to train zer0int/CLIP-Registers-Gated_MLP-ViT-L-14