BAAI
/

Feature Extraction
Transformers
PyTorch
clip
custom_code
QuanSun commited on
Commit
d827619
1 Parent(s): 03753cd

init README

Browse files
Files changed (1) hide show
  1. README.md +192 -0
README.md CHANGED
@@ -1,3 +1,195 @@
1
  ---
2
  license: apache-2.0
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ datasets:
4
+ - laion/laion2B-en
5
+ - kakaobrain/coyo-700m
6
  ---
7
+ <div align="center">
8
+
9
+ <h2><a href="https://arxiv.org/abs/2402.04252">EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters</a></h2>
10
+
11
+ [Quan Sun](https://github.com/Quan-Sun)<sup>1*</sup>, [Jinsheng Wang](https://github.com/Wolfwjs/)<sup>1*</sup>, [Qiying Yu](https://yqy2001.github.io)<sup>1,2*</sup>, [Yufeng Cui](https://scholar.google.com/citations?hl=en&user=5Ydha2EAAAAJ)<sup>1</sup>, [Fan Zhang](https://scholar.google.com/citations?user=VsJ39HMAAAAJ)<sup>1</sup>, [Xiaosong Zhang](https://zhangxiaosong18.github.io)<sup>1</sup>, [Xinlong Wang](https://www.xloong.wang/)<sup>1</sup>
12
+
13
+ <sup>1</sup> [BAAI](https://www.baai.ac.cn/english.html), <sup>2</sup> [THU](https://air.tsinghua.edu.cn) <br><sup>*</sup> equal contribution
14
+
15
+ [Paper](https://arxiv.org/abs/2402.04252) | [Github](https://github.com/baaivision/EVA/tree/master/EVA-CLIP-18B)
16
+
17
+ </div>
18
+
19
+
20
+ Scaling up contrastive language-image pretraining (CLIP) is critical for empowering both vision and multimodal models. We present EVA-CLIP-18B, the largest and most powerful open-source CLIP model to date, with 18-billion parameters. With only 6-billion training samples seen, EVA-CLIP-18B achieves an exceptional **80.7%** zero-shot top-1 accuracy averaged across 27 widely recognized image classification benchmarks, outperforming its forerunner EVA-CLIP (5-billion parameters) and other open-source CLIP models by a large margin. Remarkably, we observe a consistent performance improvement with the model size scaling of EVA-CLIP, despite maintaining a constant training dataset of 2-billion image-text pairs from LAION-2B and COYO-700M. This dataset is openly available and much smaller than the in-house datasets (e.g., DFN-5B, WebLI-10B) employed in other state-of-the-art CLIP models. EVA-CLIP-18B demonstrates the potential of EVA-style weak-to-strong visual model scaling. With our model weights made publicly available, we hope to facilitate future research in vision and multimodal foundation models.
21
+
22
+
23
+ **Table of Contents**
24
+
25
+ - [Summary of EVA-CLIP performance](#summary-of-eva-clip-performance)
26
+ - [Model Card](#model-card)
27
+ - [EVA-CLIP-8B](#eva-clip-8b)
28
+ - [EVA-CLIP-18B](#eva-clip-18b)
29
+ - [Usage](#usage)
30
+ - [BibTeX \& Citation](#bibtex--citation)
31
+
32
+
33
+ ## Summary of EVA-CLIP performance
34
+
35
+ ![summary_tab](teaser.png)
36
+
37
+ Scaling behavior of EVA-CLIP with zero-shot classification performance averaged across 27 image classification benchmarks, compared with the current state-of-the-art and largest CLIP models (224px). The diameter of each circle demonstrates the forward GFLOPs × the number of training samples seen. The performance of EVA-CLIP consistently improves as scaling up.
38
+
39
+ ## Model Card
40
+
41
+ ### EVA-8B and EVA-18B
42
+ <div align="center">
43
+
44
+ | model name | total #params | seen samples | pytorch weight |
45
+ |:-----------|:------:|:------:|:------:|
46
+ | `EVA_8B_psz14` | 7.5B | 6B | [PT](https://huggingface.co/BAAI/EVA-CLIP-8B/resolve/main/EVA_8B_psz14.bin) (`31.0GB`) |
47
+ | `EVA_18B_psz14` | 17.5B | 6B | [PT](https://huggingface.co/BAAI/EVA-CLIP-18B/resolve/main/EVA_18B_psz14.bin) (`66.0GB`) |
48
+
49
+
50
+ </div>
51
+
52
+ ### EVA-CLIP-8B
53
+
54
+ > Image encoder MIM teacher: [EVA02_CLIP_E_psz14_plus_s9B](https://huggingface.co/QuanSun/EVA-CLIP/blob/main/EVA02_CLIP_E_psz14_s4B.pt).
55
+
56
+ <div align="center">
57
+
58
+ | model name | image enc. init. ckpt | text enc. init. ckpt | total #params | training data | training batch size | gpus for training | img. cls. avg. acc. | video cls. avg. acc. | retrieval MR | hf weight | pytorch weight |
59
+ |:-----|:-----|:-----------|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|
60
+ | `EVA-CLIP-8B` | `EVA_8B_psz14` | `EVA02_CLIP_E_psz14_plus_s9B` | 8.1B | Merged-2B | 178K | 384 A100(40GB) | **79.4** | **73.6** | **86.2**| [🤗 HF](https://huggingface.co/BAAI/EVA-CLIP-8B) | [PT](https://huggingface.co/BAAI/EVA-CLIP-8B/resolve/main/EVA_CLIP_8B_psz14_s9B.pt) (`32.9GB`)|
61
+ | `EVA-CLIP-8B-448` | `EVA-CLIP-8B` | `EVA-CLIP-8B` | 8.1B | Merged-2B | 24K | 384 A100(40GB) | **80.0** | **73.7** | **86.4** | [🤗 HF](https://huggingface.co/BAAI/EVA-CLIP-8B-448) | [PT](https://huggingface.co/BAAI/EVA-CLIP-8B-448/resolve/main/EVA_CLIP_8B_psz14_plus_s0.6B.pt) (`32.9GB`)|
62
+
63
+ </div>
64
+
65
+ ### EVA-CLIP-18B
66
+
67
+ > Image encoder MIM teacher: [EVA02_CLIP_E_psz14_plus_s9B](https://huggingface.co/QuanSun/EVA-CLIP/blob/main/EVA02_CLIP_E_psz14_s4B.pt).
68
+
69
+ <div align="center">
70
+
71
+ | model name | image enc. init. ckpt | text enc. init. ckpt | total #params | training data | training batch size | gpus for training | img. cls. avg. acc. | video cls. avg. acc. | retrieval MR | hf weight | pytorch weight |
72
+ |:-----|:-----|:-----------|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|
73
+ | `EVA-CLIP-18B` | `EVA_18B_psz14` | `EVA02_CLIP_E_psz14_plus_s9B` | 18.1B | Merged-2B+ | 108K | 360 A100(40GB) | **80.7** | **75.0** | **87.8**| [🤗 HF](https://huggingface.co/BAAI/EVA-CLIP-18B) | [PT](https://huggingface.co/BAAI/EVA-CLIP-18B/resolve/main/EVA_CLIP_18B_psz14_s6B.pt) (`69.0GB`) |
74
+
75
+ </div>
76
+
77
+
78
+ - To construct Merged-2B, we merged 1.6 billion samples from [LAION-2B](https://laion.ai/blog/laion-5b/) dataset with 0.4 billion samples from [COYO-700M](https://github.com/kakaobrain/coyo-dataset).
79
+ - The Merged-2B+ consists of all samples from Merged-2B, along with 20 millions samples from [LAION-COCO](https://laion.ai/blog/laion-coco/) and 23 millions samples from Merged-video including [VideoCC](https://github.com/google-research-datasets/videoCC-data), [InternVid](https://huggingface.co/datasets/OpenGVLab/InternVid) and [WebVid-10M](https://maxbain.com/webvid-dataset/). Merged-video was added at the end of the training process.
80
+
81
+ **It's important to note that all results presented in the paper are evaluated using PyTorch weights. There may be differences in performance when using Hugging Face (hf) models.**
82
+
83
+ ## Zero-Shot Evaluation
84
+
85
+ We use [CLIP-Benchmark](https://github.com/LAION-AI/CLIP_benchmark) to evaluate the zero-shot performance of EVA-CLIP models. Following [vissl](https://github.com/facebookresearch/vissl/blob/main/extra_scripts/datasets/create_k700_data_files.py), we evauate the zero-shot video classification using 1 middle frame. Further details regarding the evaluation datasets can be found in our paper, particularly in Table 11.
86
+
87
+ ## Usage
88
+
89
+ ### Huggingface Version
90
+ ```python
91
+
92
+ from PIL import Image
93
+ from transformers import AutoModel, AutoConfig
94
+ from transformers import CLIPImageProcessor, pipeline, CLIPTokenizer
95
+ import torch
96
+ import torchvision.transforms as T
97
+ from torchvision.transforms import InterpolationMode
98
+
99
+ image_path = "CLIP.png"
100
+ model_name_or_path = "BAAI/EVA-CLIP-18B" # or /path/to/local/EVA-CLIP-18B
101
+ image_size = 224
102
+
103
+ processor = CLIPImageProcessor.from_pretrained("openai/clip-vit-large-patch14")
104
+
105
+ # use image processor with conig
106
+ # processor = CLIPImageProcessor(size={"shortest_edge":image_size}, do_center_crop=True, crop_size=image_size)
107
+
108
+ ## you can also directly use the image processor by torchvision
109
+ ## squash
110
+ # processor = T.Compose(
111
+ # [
112
+ # T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
113
+ # T.Resize((image_size, image_size), interpolation=InterpolationMode.BICUBIC),
114
+ # T.ToTensor(),
115
+ # T.Normalize(mean=(0.48145466, 0.4578275, 0.40821073), std=(0.26862954, 0.26130258, 0.27577711))
116
+ # ]
117
+ # )
118
+ ## shortest
119
+ ## processor = T.Compose(
120
+ # [
121
+ # T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
122
+ # T.Resize(image_size, interpolation=InterpolationMode.BICUBIC),
123
+ # T.CenterCrop(image_size),
124
+ # T.ToTensor(),
125
+ # T.Normalize(mean=(0.48145466, 0.4578275, 0.40821073), std=(0.26862954, 0.26130258, 0.27577711))
126
+ # ]
127
+ # )
128
+
129
+ model = AutoModel.from_pretrained(
130
+ model_name_or_path,
131
+ torch_dtype=torch.float16,
132
+ trust_remote_code=True).to('cuda').eval()
133
+
134
+ image = Image.open(image_path)
135
+ captions = ["a diagram", "a dog", "a cat"]
136
+ tokenizer = CLIPTokenizer.from_pretrained(model_name_or_path)
137
+ input_ids = tokenizer(captions, return_tensors="pt", padding=True).input_ids.to('cuda')
138
+ input_pixels = processor(images=image, return_tensors="pt", padding=True).pixel_values.to('cuda')
139
+
140
+ with torch.no_grad(), torch.cuda.amp.autocast():
141
+ image_features = model.encode_image(input_pixels)
142
+ text_features = model.encode_text(input_ids)
143
+ image_features /= image_features.norm(dim=-1, keepdim=True)
144
+ text_features /= text_features.norm(dim=-1, keepdim=True)
145
+
146
+ label_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
147
+ print(f"Label probs: {label_probs}")
148
+ ```
149
+
150
+ ### Pytorch version
151
+
152
+ Go to [GitHub](https://github.com/baaivision/EVA/tree/master/EVA-CLIP-18B)
153
+
154
+ ```python
155
+ import torch
156
+ from eva_clip import create_model_and_transforms, get_tokenizer
157
+ from PIL import Image
158
+
159
+ model_name = "EVA-CLIP-18B"
160
+ pretrained = "eva_clip" # or "/path/to/EVA_CLIP_18B_psz14_s6B.pt"
161
+
162
+ image_path = "CLIP.png"
163
+ caption = ["a diagram", "a dog", "a cat"]
164
+
165
+ device = "cuda" if torch.cuda.is_available() else "cpu"
166
+ model, _, processor = create_model_and_transforms(model_name, pretrained, force_custom_clip=True)
167
+ tokenizer = get_tokenizer(model_name)
168
+ model = model.to(device)
169
+
170
+ image = processor(Image.open(image_path)).unsqueeze(0).to(device)
171
+ text = tokenizer(["a diagram", "a dog", "a cat"]).to(device)
172
+
173
+ with torch.no_grad(), torch.cuda.amp.autocast():
174
+ image_features = model.encode_image(image)
175
+ text_features = model.encode_text(text)
176
+ image_features /= image_features.norm(dim=-1, keepdim=True)
177
+ text_features /= text_features.norm(dim=-1, keepdim=True)
178
+
179
+ text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
180
+
181
+ print("Label probs:", text_probs)
182
+ ```
183
+
184
+ You can leverage [deepspeed.zero.Init()](https://deepspeed.readthedocs.io/en/stable/zero3.html#constructing-massive-models) with deepspeed zero stage 3 if you have limited CPU memory. For loading a pretrained checkpoint in the context of using deepspeed.zero.Init(), it's advised to use the `load_zero_partitions()` function in `eva_clip/factory.py`.
185
+
186
+ ## BibTeX & Citation
187
+
188
+ ```
189
+ @article{EVA-CLIP-18B,
190
+ title={EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters},
191
+ author={Quan Sun and Jinsheng Wang and Qiying Yu and Yufeng Cui and Fan Zhang and Xiaosong Zhang and Xinlong Wang},
192
+ journal={arXiv preprint arXiv:2402.04252},
193
+ year={2023}
194
+ }
195
+ ```