arxiv:2303.15389

EVA-CLIP: Improved Training Techniques for CLIP at Scale

Published on Mar 27, 2023

Authors:

Quan Sun ,

Abstract

Contrastive language-image pre-training, CLIP for short, has gained increasing attention for its potential in various scenarios. In this paper, we propose EVA-<PRE_TAG>CLIP</POST_TAG>, a series of models that significantly improve the efficiency and effectiveness of CLIP training. Our approach incorporates new techniques for representation learning, optimization, and augmentation, enabling EVA-<PRE_TAG>CLIP</POST_TAG> to achieve superior performance compared to previous CLIP models with the same number of parameters but significantly smaller training costs. Notably, our largest 5.0B-parameter EVA-02-<PRE_TAG>CLIP-E/14+</POST_TAG> with only 9 billion seen samples achieves 82.0 zero-shot top-1 accuracy on ImageNet-1K val. A smaller EVA-02-<PRE_TAG>CLIP-L/14+</POST_TAG> with only 430 million parameters and 6 billion seen samples achieves 80.4 zero-shot top-1 accuracy on ImageNet-1K val. To facilitate open access and open research, we release the complete suite of EVA-<PRE_TAG>CLIP</POST_TAG> to the community at https://github.com/baaivision/EVA/tree/master/EVA-<PRE_TAG>CLIP</POST_TAG>.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 19

Browse 19 models citing this paper

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2303.15389 in a dataset README.md to link it from this page.

Spaces citing this paper 47

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.