BAAI
/

In this work, we present a simple post-training approach for CLIP models, which largely overcomes its visual shortcomings via a self-supervised diffusion process. We introduce DIVA, which uses the DIffusion model as a Visual Assistant for CLIP. Specifically, DIVA leverages generative feedback from text-to-image diffusion models to optimize CLIP representations, with only images (w/o corresponding text). We demonstrate that DIVA improves CLIP's performance on the challenging MMVP-VLM benchmark which assesses fine-grained visual abilities to a large extent (e.g., 3-7% ↑), and enhances the performance of MLLMs and vision models on multimodal understanding and segmentation tasks. Extensive evaluation on 29 image classification and retrieval benchmarks confirms that DIVA preserves CLIP's strong zero-shot capabilities.

Model Zoo

Method Image Size Params (M) Average Score
OpenAI ViT-L-14 224Β² 427.6 25.9 (+6.6)
OpenAI ViT-L-14 336Β² 427.9 25.2 (+5.2)
MetaCLIP ViT-L-14 224Β² 427.6 27.4 (+3.7)
MetaCLIP ViT-H-14 224Β² 986.1 31.9 (+6.7)
SigLIP ViT-SO-14 224Β² 877.4 40.7 (+2.9)
SigLIP ViT-SO-14 384Β² 878.0 38.5 (+1.5)
DFN ViT-H-14 224Β² 986.1 43.7 (+4.4)
DFN ViT-H-14 378Β² 986.7 37.8 (+3.0)

πŸ“ Citation

If you find DIVA is helpful for your research, please consider citingπŸ“our paper and give us a github star⭐:

@article{wang2024diffusion,
      title={Diffusion Feedback Helps CLIP See Better},
      author={Wang, Wenxuan and Sun, Quan and Zhang, Fan and Tang, Yepeng and Liu, Jing and Wang, Xinlong},
      journal={arXiv preprint arXiv:2407.20171},
      year={2024}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference API
Unable to determine this model's library. Check the docs .