kunyi commited on
Commit
f19719d
1 Parent(s): 42bec8a

Upload 6 files

Browse files
Files changed (6) hide show
  1. README.md +283 -0
  2. README_CN.md +280 -0
  3. config.json +114 -0
  4. preprocessor_config.json +21 -0
  5. pytorch_model.bin +3 -0
  6. vocab.txt +0 -0
README.md CHANGED
@@ -1,3 +1,286 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ pipeline_tag: zero-shot-classification
4
+ widget:
5
+ - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/cat-dog-music.png
6
+ candidate_labels: 演奏, 运动
7
+ example_title: 猫和狗
8
  ---
9
+ [**中文说明**](README_CN.md) | [**English**](README.md)
10
+ # Introduction
11
+ This project aims to provide a better Chinese CLIP model. The training data used in this project consists of publicly accessible image URLs and related Chinese text descriptions, totaling 400 million. After screening, we ultimately used 100 million data for training.
12
+ This project is produced by QQ-ARC Joint Lab, Tencent PCG.
13
+ <br><br>
14
+
15
+ # Models and Results
16
+ <span id="model_card"></span>
17
+ ## Model Card
18
+ QA-CLIP currently has three different open-source models of different sizes, and their model information and download links are shown in the table below:
19
+ <table border="1" width="100%">
20
+ <tr align="center">
21
+ <th>Model</th><th>Ckp</th><th>Params</th><th>Vision</th><th>Params of Vision</th><th>Text</th><th>Params of Text</th><th>Resolution</th>
22
+ </tr>
23
+ <tr align="center">
24
+ <td>QA-CLIP<sub>RN50</sub></td><td><a href="https://huggingface.co/TencentARC/QA-CLIP/resolve/main/QA-CLIP-RN50.pt">Download</a></td><td>77M</td><td>ResNet50</td><td>38M</td><td>RBT3</td><td>39M</td><td>224</td>
25
+ </tr>
26
+ <tr align="center">
27
+ <td>QA-CLIP<sub>ViT-B/16</sub></td><td><a href="https://huggingface.co/TencentARC/QA-CLIP/resolve/main/QA-CLIP-base.pt">Download</a></td><td>188M</td><td>ViT-B/16</td><td>86M</td><td>RoBERTa-wwm-Base</td><td>102M</td><td>224</td>
28
+ </tr>
29
+ <tr align="center">
30
+ <td>QA-CLIP<sub>ViT-L/14</sub></td><td><a href="https://huggingface.co/TencentARC/QA-CLIP/resolve/main/QA-CLIP-large.pt">Download</a></td><td>406M</td><td>ViT-L/14</td><td>304M</td><td>RoBERTa-wwm-Base</td><td>102M</td><td>224</td>
31
+ </tr>
32
+ </table>
33
+ <br>
34
+
35
+ ## Results
36
+ We conducted zero-shot tests on [MUGE Retrieval](https://tianchi.aliyun.com/muge), [Flickr30K-CN](https://github.com/li-xirong/cross-lingual-cap), and [COCO-CN](https://github.com/li-xirong/coco-cn) datasets for image-text retrieval tasks. For the image zero-shot classification task, we tested on the ImageNet dataset. The test results are shown in the table below:
37
+
38
+ **Flickr30K-CN Zero-shot Retrieval (Official Test Set)**:
39
+ <table border="1" width="120%">
40
+ <tr align="center">
41
+ <th>Task</th><th colspan="3">Text-to-Image</th><th colspan="3">Image-to-Text</th>
42
+ </tr>
43
+ <tr align="center">
44
+ <td>Metric</td><td>R@1</td><td>R@5</td><td>R@10</td><td>R@1</td><td>R@5</td><td>R@10</td>
45
+ </tr>
46
+ <tr align="center">
47
+ <td width="120%">CN-CLIP<sub>RN50</sub></td><td>48.8</td><td>76.0</td><td>84.6</td><td>60.0</td><td>85.9</td><td>92.0</td>
48
+ </tr>
49
+ <tr align="center">
50
+ <td width="120%">QA-CLIP<sub>RN50</sub></td><td><b>50.5</b></td><td><b>77.4</b></td><td><b>86.1</b></td><td><b>67.1</b></td><td><b>87.9</b></td><td><b>93.2</b></td>
51
+ </tr>
52
+ <tr align="center">
53
+ <td width="120%">CN-CLIP<sub>ViT-B/16</sub></td><td>62.7</td><td>86.9</td><td>92.8</td><td>74.6</td><td>93.5</td><td>97.1</td>
54
+ </tr>
55
+ <tr align="center">
56
+ <td width="120%">QA-CLIP<sub>ViT-B/16</sub></td><td><b>63.8</b></td><td><b>88.0</b></td><td><b>93.2</b></td><td><b>78.4</b></td><td><b>96.1</b></td><td><b>98.5</b></td>
57
+ </tr>
58
+ <tr align="center">
59
+ <td width="120%">CN-CLIP<sub>ViT-L/14</sub></td><td>68.0</td><td>89.7</td><td>94.4</td><td>80.2</td><td>96.6</td><td>98.2</td>
60
+ </tr>
61
+ <tr align="center">
62
+ <td width="120%">AltClip<sub>ViT-L/14</sub></td><td><b>69.7</b></td><td>90.1</td><td>94.8</td><td>84.8</td><td>97.7</td><td>99.1</td>
63
+ </tr>
64
+ <tr align="center">
65
+ <td width="120%">CN-CLIP<sub>ViT-L/14</sub></td><td>69.3</td><td><b>90.3</b></td><td><b>94.7</b></td><td><b>85.3</b></td><td><b>97.9</b></td><td><b>99.2</b></td>
66
+ </tr>
67
+ </table>
68
+ <br>
69
+
70
+ **MUGE Zero-shot Retrieval (Official Validation Set)**:
71
+ <table border="1" width="120%">
72
+ <tr align="center">
73
+ <th>Task</th><th colspan="3">Text-to-Image</th><th colspan="3">Image-to-Text</th>
74
+ </tr>
75
+ <tr align="center">
76
+ <td>Metric</td><td>R@1</td><td>R@5</td><td>R@10</td><td>R@1</td><td>R@5</td><td>R@10</td>
77
+ </tr>
78
+ <tr align="center">
79
+ <td width="120%">CN-CLIP<sub>RN50</sub></td><td>42.6</td><td>68.5</td><td>78.0</td><td>30.0</td><td>56.2</td><td>66.9</td>
80
+ </tr>
81
+ <tr align="center">
82
+ <td width="120%">QA-CLIP<sub>RN50</sub></td><td><b>44.0</b></td><td><b>69.9</b></td><td><b>79.5</b></td><td><b>32.4</b></td><td><b>59.5</b></td><td><b>70.3</b></td>
83
+ </tr>
84
+ <tr align="center">
85
+ <td width="120%">CN-CLIP<sub>ViT-B/16</sub></td><td>52.1</td><td>76.7</td><td>84.4</td><td>38.7</td><td>65.6</td><td>75.1</td>
86
+ </tr>
87
+ <tr align="center">
88
+ <td width="120%">QA-CLIP<sub>ViT-B/16</sub></td><td><b>53.2</b></td><td><b>77.7</b></td><td><b>85.1</b></td><td><b>40.7</b></td><td><b>68.2</b></td><td><b>77.2</b></td>
89
+ </tr>
90
+ <tr align="center">
91
+ <td width="120%">CN-CLIP<sub>ViT-L/14</sub></td><td>56.4</td><td>79.8</td><td>86.2</td><td>42.6</td><td>69.8</td><td>78.6</td>
92
+ </tr>
93
+ <tr align="center">
94
+ <td width="120%">AltClip<sub>ViT-L/14</sub></td><td>29.6</td><td>49.9</td><td>58.8</td><td>21.4</td><td>42.0</td><td>51.9</td>
95
+ </tr>
96
+ <tr align="center">
97
+ <td width="120%">QA-CLIP<sub>ViT-L/14</sub></td><td><b>57.4</b></td><td><b>81.0</b></td><td><b>87.7</b></td><td><b>45.5</b></td><td><b>73.0</b></td><td><b>81.4</b></td>
98
+ </tr>
99
+ </table>
100
+ <br>
101
+
102
+ **COCO-CN Zero-shot Retrieval (Official Test Set)**:
103
+ <table border="1" width="120%">
104
+ <tr align="center">
105
+ <th>Task</th><th colspan="3">Text-to-Image</th><th colspan="3">Image-to-Text</th>
106
+ </tr>
107
+ <tr align="center">
108
+ <td>Metric</td><td>R@1</td><td>R@5</td><td>R@10</td><td>R@1</td><td>R@5</td><td>R@10</td>
109
+ </tr>
110
+ <tr align="center">
111
+ <td width="120%">CN-CLIP<sub>RN50</sub></td><td>48.1</td><td>81.3</td><td>90.5</td><td>50.9</td><td>81.1</td><td>90.5</td>
112
+ </tr>
113
+ <tr align="center">
114
+ <td width="120%">QA-CLIP<sub>RN50</sub></td><td><b>50.1</b></td><td><b>82.5</b></td><td><b>91.7</b></td><td><b>56.7</b></td><td><b>85.2</b></td><td><b>92.9</b></td>
115
+ </tr>
116
+ <tr align="center">
117
+ <td width="120%">CN-CLIP<sub>ViT-B/16</sub></td><td>62.2</td><td>87.1</td><td>94.9</td><td>56.3</td><td>84.0</td><td>93.3</td>
118
+ </tr>
119
+ <tr align="center">
120
+ <td width="120%">QA-CLIP<sub>ViT-B/16</sub></td><td><b>62.9</b></td><td><b>87.7</b></td><td><b>94.7</b></td><td><b>61.5</b></td><td><b>87.6</b></td><td><b>94.8</b></td>
121
+ </tr>
122
+ <tr align="center">
123
+ <td width="120%">CN-CLIP<sub>ViT-L/14</sub></td><td>64.9</td><td>88.8</td><td>94.2</td><td>60.6</td><td>84.4</td><td>93.1</td>
124
+ </tr>
125
+ <tr align="center">
126
+ <td width="120%">AltClip<sub>ViT-L/14</sub></td><td>63.5</td><td>87.6</td><td>93.5</td><td>62.6</td><td><b>88.5</b></td><td><b>95.9</b></td>
127
+ </tr>
128
+ <tr align="center">
129
+ <td width="120%">QA-CLIP<sub>ViT-L/14</sub></td><td><b>65.7</b></td><td><b>90.2</b></td><td><b>95.0</b></td><td><b>64.5</b></td><td>88.3</td><td>95.1</td>
130
+ </tr>
131
+ </table>
132
+ <br>
133
+
134
+ **Zero-shot Image Classification on ImageNet**:
135
+ <table border="1" width="120%">
136
+ <tr align="center">
137
+ <th>Task</th><th colspan="1">ImageNet</th>
138
+ </tr>
139
+ <tr align="center">
140
+ <td width="120%">CN-CLIP<sub>RN50</sub></td><td>33.5</td>
141
+ </tr>
142
+ <tr align="center">
143
+ <td width="120%">QA-CLIP<sub>RN50</sub></td><td><b>35.5</b></td>
144
+ </tr>
145
+ <tr align="center">
146
+ <td width="120%">CN-CLIP<sub>ViT-B/16</sub></td><td>48.4</td>
147
+ </tr>
148
+ <tr align="center">
149
+ <td width="120%">QA-CLIP<sub>ViT-B/16</sub></td><td><b>49.7</b></td>
150
+ </tr>
151
+ <tr align="center">
152
+ <td width="120%">CN-CLIP<sub>ViT-L/14</sub></td><td>54.7</td>
153
+ </tr>
154
+ <tr align="center">
155
+ <td width="120%">QA-CLIP<sub>ViT-L/14</sub></td><td><b>55.8</b></td>
156
+ </tr>
157
+ </table>
158
+ <br>
159
+
160
+ <br><br>
161
+
162
+
163
+ # Getting Started
164
+ ## Installation Requirements
165
+ Environment configuration requirements:
166
+
167
+ * python >= 3.6.4
168
+ * pytorch >= 1.8.0 (with torchvision >= 0.9.0)
169
+ * CUDA Version >= 10.2
170
+
171
+ Install required packages:
172
+ ```bash
173
+ cd /yourpath/QA-CLIP-main
174
+ pip install -r requirements.txt
175
+ ```
176
+
177
+ ## Inference Code
178
+ ```bash
179
+ export PYTHONPATH=/yourpath/QA-CLIP-main
180
+ ```
181
+ Inference code example:
182
+ ```python
183
+ import torch
184
+ from PIL import Image
185
+
186
+ import clip as clip
187
+ from clip import load_from_name, available_models
188
+ print("Available models:", available_models())
189
+ # Available models: ['ViT-B-16', 'ViT-L-14', 'RN50']
190
+
191
+ device = "cuda" if torch.cuda.is_available() else "cpu"
192
+ model, preprocess = load_from_name("ViT-B-16", device=device, download_root='./')
193
+ model.eval()
194
+ image = preprocess(Image.open("examples/pokemon.jpeg")).unsqueeze(0).to(device)
195
+ text = clip.tokenize(["杰尼龟", "妙蛙种子", "小火龙", "皮卡丘"]).to(device)
196
+
197
+ with torch.no_grad():
198
+ image_features = model.encode_image(image)
199
+ text_features = model.encode_text(text)
200
+ # Normalize the features. Please use the normalized features for downstream tasks.
201
+ image_features /= image_features.norm(dim=-1, keepdim=True)
202
+ text_features /= text_features.norm(dim=-1, keepdim=True)
203
+
204
+ logits_per_image, logits_per_text = model.get_similarity(image, text)
205
+ probs = logits_per_image.softmax(dim=-1).cpu().numpy()
206
+
207
+ print("Label probs:", probs)
208
+ ```
209
+ <br><br>
210
+
211
+ ## Prediction and Evaluation
212
+
213
+ ### Download Image-text Retrieval Test Dataset
214
+ In Project <b>[Chinese-CLIP](https://github.com/OFA-Sys/Chinese-CLIP)</b>, the test set has already been preprocessed. Here is the download link they provided:
215
+
216
+ MUGE dataset:[download link](https://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/datasets/MUGE.zip)
217
+
218
+ Flickr30K-CN dataset:[download link](https://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/datasets/Flickr30k-CN.zip)
219
+
220
+ Additionally, obtaining the [COCO-CN](https://github.com/li-xirong/coco-cn) dataset requires applying to the original author.
221
+
222
+ ### Download ImageNet Dataset
223
+ Please download the raw data yourself,[Chinese Label](http://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/datasets/ImageNet-1K/label_cn.txt) and [English Label](http://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/datasets/ImageNet-1K/label.txt) are provided by Project <b>[Chinese-CLIP](https://github.com/OFA-Sys/Chinese-CLIP)</b>
224
+ ### Image-text Retrieval Evaluation
225
+ The image-text retrieval evaluation code can be referred to as follows:
226
+ ```bash
227
+ split=test # Designate the computation of features for the valid or test set
228
+ resume=your_ckp_path
229
+ DATAPATH=your_DATAPATH
230
+ dataset_name=Flickr30k-CN
231
+ # dataset_name=MUGE
232
+
233
+ python -u eval/extract_features.py \
234
+ --extract-image-feats \
235
+ --extract-text-feats \
236
+ --image-data="${DATAPATH}/datasets/${dataset_name}/lmdb/${split}/imgs" \
237
+ --text-data="${DATAPATH}/datasets/${dataset_name}/${split}_texts.jsonl" \
238
+ --img-batch-size=32 \
239
+ --text-batch-size=32 \
240
+ --context-length=52 \
241
+ --resume=${resume} \
242
+ --vision-model=ViT-B-16 \
243
+ --text-model=RoBERTa-wwm-ext-base-chinese
244
+
245
+ python -u eval/make_topk_predictions.py \
246
+ --image-feats="${DATAPATH}/datasets/${dataset_name}/${split}_imgs.img_feat.jsonl" \
247
+ --text-feats="${DATAPATH}/datasets/${dataset_name}/${split}_texts.txt_feat.jsonl" \
248
+ --top-k=10 \
249
+ --eval-batch-size=32768 \
250
+ --output="${DATAPATH}/datasets/${dataset_name}/${split}_predictions.jsonl"
251
+
252
+ python -u eval/make_topk_predictions_tr.py \
253
+ --image-feats="${DATAPATH}/datasets/${dataset_name}/${split}_imgs.img_feat.jsonl" \
254
+ --text-feats="${DATAPATH}/datasets/${dataset_name}/${split}_texts.txt_feat.jsonl" \
255
+ --top-k=10 \
256
+ --eval-batch-size=32768 \
257
+ --output="${DATAPATH}/datasets/${dataset_name}/${split}_tr_predictions.jsonl"
258
+
259
+ python eval/evaluation.py \
260
+ ${DATAPATH}/datasets/${dataset_name}/${split}_texts.jsonl \
261
+ ${DATAPATH}/datasets/${dataset_name}/${split}_predictions.jsonl \
262
+ ${DATAPATH}/datasets/${dataset_name}/output1.json
263
+ cat ${DATAPATH}/datasets/${dataset_name}/output1.json
264
+
265
+ python eval/transform_ir_annotation_to_tr.py \
266
+ --input ${DATAPATH}/datasets/${dataset_name}/${split}_texts.jsonl
267
+
268
+ python eval/evaluation_tr.py \
269
+ ${DATAPATH}/datasets/${dataset_name}/${split}_texts.tr.jsonl \
270
+ ${DATAPATH}/datasets/${dataset_name}/${split}_tr_predictions.jsonl \
271
+ ${DATAPATH}/datasets/${dataset_name}/output2.json
272
+ cat ${DATAPATH}/datasets/${dataset_name}/output2.json
273
+ ```
274
+
275
+ ### ImageNet Zero-shot Classification
276
+ The ImageNet zero-shot classification code can be referred to as follows
277
+ ```bash
278
+ bash scripts/zeroshot_eval.sh 0 \
279
+ ${DATAPATH} imagenet \
280
+ ViT-B-16 RoBERTa-wwm-ext-base-chinese \
281
+ ./pretrained_weights/QA-CLIP-base.pt
282
+ ```
283
+ <br><br>
284
+ # Acknowledgments
285
+ The project code is based on implementation of <b>[Chinese-CLIP](https://github.com/OFA-Sys/Chinese-CLIP)</b>, and we are very grateful for their outstanding open-source contributions.
286
+ <br><br>
README_CN.md ADDED
@@ -0,0 +1,280 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [**中文说明**](README_CN.md) | [**English**](README.md)
2
+ # 项目介绍
3
+ 本项目旨在提供更好的中文CLIP模型。该项目使用的训练数据均为公开可访问的图像URL及相关中文文本描述,总量达到400M。经过筛选后,我们最终使用了100M的数据进行训练。
4
+ 本项目于QQ-ARC Joint Lab, Tencent PCG完成
5
+ <br><br>
6
+
7
+ # 模型及实验
8
+ <span id="model_card"></span>
9
+ ## 模型规模 & 下载链接
10
+ QA-CLIP目前开源3个不同规模,其模型信息和下载方式见下表:
11
+
12
+ <table border="1" width="100%">
13
+ <tr align="center">
14
+ <th>模型规模</th><th>下载链接</th><th>参数量</th><th>视觉侧骨架</th><th>视觉侧参数量</th><th>文本侧骨架</th><th>文本侧参数量</th><th>分辨率</th>
15
+ </tr>
16
+ <tr align="center">
17
+ <td>QA-CLIP<sub>RN50</sub></td><td><a href="https://huggingface.co/TencentARC/QA-CLIP/resolve/main/QA-CLIP-RN50.pt">Download</a></td><td>77M</td><td>ResNet50</td><td>38M</td><td>RBT3</td><td>39M</td><td>224</td>
18
+ </tr>
19
+ <tr align="center">
20
+ <td>QA-CLIP<sub>ViT-B/16</sub></td><td><a href="https://huggingface.co/TencentARC/QA-CLIP/resolve/main/QA-CLIP-base.pt">Download</a></td><td>188M</td><td>ViT-B/16</td><td>86M</td><td>RoBERTa-wwm-Base</td><td>102M</td><td>224</td>
21
+ </tr>
22
+ <tr align="center">
23
+ <td>QA-CLIP<sub>ViT-L/14</sub></td><td><a href="https://huggingface.co/TencentARC/QA-CLIP/resolve/main/QA-CLIP-large.pt">Download</a></td><td>406M</td><td>ViT-L/14</td><td>304M</td><td>RoBERTa-wwm-Base</td><td>102M</td><td>224</td>
24
+ </tr>
25
+ </table>
26
+ <br>
27
+
28
+ ## 实验结果
29
+ 针对图文检索任务,我们在[MUGE Retrieval](https://tianchi.aliyun.com/muge)、[Flickr30K-CN](https://github.com/li-xirong/cross-lingual-cap)和[COCO-CN](https://github.com/li-xirong/coco-cn)上进行了zero-shot测试。
30
+ 针对图像零样本分类任务,我们在ImageNet数据集上进行了测试。测试结果见下表:
31
+
32
+
33
+ **Flickr30K-CN Zero-shot Retrieval (Official Test Set)**:
34
+ <table border="1" width="120%">
35
+ <tr align="center">
36
+ <th>Task</th><th colspan="3">Text-to-Image</th><th colspan="3">Image-to-Text</th>
37
+ </tr>
38
+ <tr align="center">
39
+ <td>Metric</td><td>R@1</td><td>R@5</td><td>R@10</td><td>R@1</td><td>R@5</td><td>R@10</td>
40
+ </tr>
41
+ <tr align="center">
42
+ <td width="120%">CN-CLIP<sub>RN50</sub></td><td>48.8</td><td>76.0</td><td>84.6</td><td>60.0</td><td>85.9</td><td>92.0</td>
43
+ </tr>
44
+ <tr align="center">
45
+ <td width="120%">QA-CLIP<sub>RN50</sub></td><td><b>50.5</b></td><td><b>77.4</b></td><td><b>86.1</b></td><td><b>67.1</b></td><td><b>87.9</b></td><td><b>93.2</b></td>
46
+ </tr>
47
+ <tr align="center">
48
+ <td width="120%">CN-CLIP<sub>ViT-B/16</sub></td><td>62.7</td><td>86.9</td><td>92.8</td><td>74.6</td><td>93.5</td><td>97.1</td>
49
+ </tr>
50
+ <tr align="center">
51
+ <td width="120%">QA-CLIP<sub>ViT-B/16</sub></td><td><b>63.8</b></td><td><b>88.0</b></td><td><b>93.2</b></td><td><b>78.4</b></td><td><b>96.1</b></td><td><b>98.5</b></td>
52
+ </tr>
53
+ <tr align="center">
54
+ <td width="120%">CN-CLIP<sub>ViT-L/14</sub></td><td>68.0</td><td>89.7</td><td>94.4</td><td>80.2</td><td>96.6</td><td>98.2</td>
55
+ </tr>
56
+ <tr align="center">
57
+ <td width="120%">AltClip<sub>ViT-L/14</sub></td><td><b>69.7</b></td><td>90.1</td><td>94.8</td><td>84.8</td><td>97.7</td><td>99.1</td>
58
+ </tr>
59
+ <tr align="center">
60
+ <td width="120%">CN-CLIP<sub>ViT-L/14</sub></td><td>69.3</td><td><b>90.3</b></td><td><b>94.7</b></td><td><b>85.3</b></td><td><b>97.9</b></td><td><b>99.2</b></td>
61
+ </tr>
62
+ </table>
63
+ <br>
64
+
65
+ **MUGE Zero-shot Retrieval (Official Validation Set)**:
66
+ <table border="1" width="120%">
67
+ <tr align="center">
68
+ <th>Task</th><th colspan="3">Text-to-Image</th><th colspan="3">Image-to-Text</th>
69
+ </tr>
70
+ <tr align="center">
71
+ <td>Metric</td><td>R@1</td><td>R@5</td><td>R@10</td><td>R@1</td><td>R@5</td><td>R@10</td>
72
+ </tr>
73
+ <tr align="center">
74
+ <td width="120%">CN-CLIP<sub>RN50</sub></td><td>42.6</td><td>68.5</td><td>78.0</td><td>30.0</td><td>56.2</td><td>66.9</td>
75
+ </tr>
76
+ <tr align="center">
77
+ <td width="120%">QA-CLIP<sub>RN50</sub></td><td><b>44.0</b></td><td><b>69.9</b></td><td><b>79.5</b></td><td><b>32.4</b></td><td><b>59.5</b></td><td><b>70.3</b></td>
78
+ </tr>
79
+ <tr align="center">
80
+ <td width="120%">CN-CLIP<sub>ViT-B/16</sub></td><td>52.1</td><td>76.7</td><td>84.4</td><td>38.7</td><td>65.6</td><td>75.1</td>
81
+ </tr>
82
+ <tr align="center">
83
+ <td width="120%">QA-CLIP<sub>ViT-B/16</sub></td><td><b>53.2</b></td><td><b>77.7</b></td><td><b>85.1</b></td><td><b>40.7</b></td><td><b>68.2</b></td><td><b>77.2</b></td>
84
+ </tr>
85
+ <tr align="center">
86
+ <td width="120%">CN-CLIP<sub>ViT-L/14</sub></td><td>56.4</td><td>79.8</td><td>86.2</td><td>42.6</td><td>69.8</td><td>78.6</td>
87
+ </tr>
88
+ <tr align="center">
89
+ <td width="120%">AltClip<sub>ViT-L/14</sub></td><td>29.6</td><td>49.9</td><td>58.8</td><td>21.4</td><td>42.0</td><td>51.9</td>
90
+ </tr>
91
+ <tr align="center">
92
+ <td width="120%">QA-CLIP<sub>ViT-L/14</sub></td><td><b>57.4</b></td><td><b>81.0</b></td><td><b>87.7</b></td><td><b>45.5</b></td><td><b>73.0</b></td><td><b>81.4</b></td>
93
+ </tr>
94
+ </table>
95
+ <br>
96
+
97
+ **COCO-CN Zero-shot Retrieval (Official Test Set)**:
98
+ <table border="1" width="120%">
99
+ <tr align="center">
100
+ <th>Task</th><th colspan="3">Text-to-Image</th><th colspan="3">Image-to-Text</th>
101
+ </tr>
102
+ <tr align="center">
103
+ <td>Metric</td><td>R@1</td><td>R@5</td><td>R@10</td><td>R@1</td><td>R@5</td><td>R@10</td>
104
+ </tr>
105
+ <tr align="center">
106
+ <td width="120%">CN-CLIP<sub>RN50</sub></td><td>48.1</td><td>81.3</td><td>90.5</td><td>50.9</td><td>81.1</td><td>90.5</td>
107
+ </tr>
108
+ <tr align="center">
109
+ <td width="120%">QA-CLIP<sub>RN50</sub></td><td><b>50.1</b></td><td><b>82.5</b></td><td><b>91.7</b></td><td><b>56.7</b></td><td><b>85.2</b></td><td><b>92.9</b></td>
110
+ </tr>
111
+ <tr align="center">
112
+ <td width="120%">CN-CLIP<sub>ViT-B/16</sub></td><td>62.2</td><td>87.1</td><td>94.9</td><td>56.3</td><td>84.0</td><td>93.3</td>
113
+ </tr>
114
+ <tr align="center">
115
+ <td width="120%">QA-CLIP<sub>ViT-B/16</sub></td><td><b>62.9</b></td><td><b>87.7</b></td><td><b>94.7</b></td><td><b>61.5</b></td><td><b>87.6</b></td><td><b>94.8</b></td>
116
+ </tr>
117
+ <tr align="center">
118
+ <td width="120%">CN-CLIP<sub>ViT-L/14</sub></td><td>64.9</td><td>88.8</td><td>94.2</td><td>60.6</td><td>84.4</td><td>93.1</td>
119
+ </tr>
120
+ <tr align="center">
121
+ <td width="120%">AltClip<sub>ViT-L/14</sub></td><td>63.5</td><td>87.6</td><td>93.5</td><td>62.6</td><td><b>88.5</b></td><td><b>95.9</b></td>
122
+ </tr>
123
+ <tr align="center">
124
+ <td width="120%">QA-CLIP<sub>ViT-L/14</sub></td><td><b>65.7</b></td><td><b>90.2</b></td><td><b>95.0</b></td><td><b>64.5</b></td><td>88.3</td><td>95.1</td>
125
+ </tr>
126
+ </table>
127
+ <br>
128
+
129
+ **Zero-shot Image Classification on ImageNet**:
130
+ <table border="1" width="120%">
131
+ <tr align="center">
132
+ <th>Task</th><th colspan="1">ImageNet</th>
133
+ </tr>
134
+ <tr align="center">
135
+ <td width="120%">CN-CLIP<sub>RN50</sub></td><td>33.5</td>
136
+ </tr>
137
+ <tr align="center">
138
+ <td width="120%">QA-CLIP<sub>RN50</sub></td><td><b>35.5</b></td>
139
+ </tr>
140
+ <tr align="center">
141
+ <td width="120%">CN-CLIP<sub>ViT-B/16</sub></td><td>48.4</td>
142
+ </tr>
143
+ <tr align="center">
144
+ <td width="120%">QA-CLIP<sub>ViT-B/16</sub></td><td><b>49.7</b></td>
145
+ </tr>
146
+ <tr align="center">
147
+ <td width="120%">CN-CLIP<sub>ViT-L/14</sub></td><td>54.7</td>
148
+ </tr>
149
+ <tr align="center">
150
+ <td width="120%">QA-CLIP<sub>ViT-L/14</sub></td><td><b>55.8</b></td>
151
+ </tr>
152
+ </table>
153
+ <br>
154
+
155
+ <br><br>
156
+
157
+
158
+ # 使用教程
159
+ ## 安装要求
160
+ 环境配置要求:
161
+
162
+ * python >= 3.6.4
163
+ * pytorch >= 1.8.0 (with torchvision >= 0.9.0)
164
+ * CUDA Version >= 10.2
165
+
166
+ 安装本项目所需库
167
+ ```bash
168
+ cd /yourpath/QA-CLIP-main
169
+ pip install -r requirements.txt
170
+ ```
171
+
172
+ ## 推理代码
173
+ ```bash
174
+ export PYTHONPATH=/yourpath/QA-CLIP-main
175
+ ```
176
+ 推理代码示例:
177
+ ```python
178
+ import torch
179
+ from PIL import Image
180
+
181
+ import clip as clip
182
+ from clip import load_from_name, available_models
183
+ print("Available models:", available_models())
184
+ # Available models: ['ViT-B-16', 'ViT-L-14', 'RN50']
185
+
186
+ device = "cuda" if torch.cuda.is_available() else "cpu"
187
+ model, preprocess = load_from_name("ViT-B-16", device=device, download_root='./')
188
+ model.eval()
189
+ image = preprocess(Image.open("examples/pokemon.jpeg")).unsqueeze(0).to(device)
190
+ text = clip.tokenize(["杰尼龟", "妙蛙种子", "小火龙", "皮卡丘"]).to(device)
191
+
192
+ with torch.no_grad():
193
+ image_features = model.encode_image(image)
194
+ text_features = model.encode_text(text)
195
+ # 对特征进行归一化,请使用归一化后的图文特征用于下游任务
196
+ image_features /= image_features.norm(dim=-1, keepdim=True)
197
+ text_features /= text_features.norm(dim=-1, keepdim=True)
198
+
199
+ logits_per_image, logits_per_text = model.get_similarity(image, text)
200
+ probs = logits_per_image.softmax(dim=-1).cpu().numpy()
201
+
202
+ print("Label probs:", probs)
203
+ ```
204
+ <br><br>
205
+
206
+ ## 预测及评估
207
+
208
+ ### 图文检索测试数据集下载
209
+ <b>[Chinese-CLIP](https://github.com/OFA-Sys/Chinese-CLIP)</b>项目中已经预处理好测试集,这是他们提供的下载链接:
210
+
211
+ MUGE数据:[下载链接](https://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/datasets/MUGE.zip)
212
+
213
+ Flickr30K-CN数据:[下载链接](https://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/datasets/Flickr30k-CN.zip)
214
+
215
+ 另外[COCO-CN](https://github.com/li-xirong/coco-cn)数据的获取需要向原作者进行申请
216
+ ### ImageNet数据集下载
217
+ 原始数据请自行下载,[中文标签](http://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/datasets/ImageNet-1K/label_cn.txt)和[英文标签](http://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/datasets/ImageNet-1K/label.txt)同样由<b>[Chinese-CLIP](https://github.com/OFA-Sys/Chinese-CLIP)</b>项目提供
218
+ ### 图文检索评估
219
+ 图文检索评估代码可以参考如下:
220
+ ```bash
221
+ split=test # 指定计算valid或test集特征
222
+ resume=your_ckp_path
223
+ DATAPATH=your_DATAPATH
224
+ dataset_name=Flickr30k-CN
225
+ # dataset_name=MUGE
226
+
227
+ python -u eval/extract_features.py \
228
+ --extract-image-feats \
229
+ --extract-text-feats \
230
+ --image-data="${DATAPATH}/datasets/${dataset_name}/lmdb/${split}/imgs" \
231
+ --text-data="${DATAPATH}/datasets/${dataset_name}/${split}_texts.jsonl" \
232
+ --img-batch-size=32 \
233
+ --text-batch-size=32 \
234
+ --context-length=52 \
235
+ --resume=${resume} \
236
+ --vision-model=ViT-B-16 \
237
+ --text-model=RoBERTa-wwm-ext-base-chinese
238
+
239
+ python -u eval/make_topk_predictions.py \
240
+ --image-feats="${DATAPATH}/datasets/${dataset_name}/${split}_imgs.img_feat.jsonl" \
241
+ --text-feats="${DATAPATH}/datasets/${dataset_name}/${split}_texts.txt_feat.jsonl" \
242
+ --top-k=10 \
243
+ --eval-batch-size=32768 \
244
+ --output="${DATAPATH}/datasets/${dataset_name}/${split}_predictions.jsonl"
245
+
246
+ python -u eval/make_topk_predictions_tr.py \
247
+ --image-feats="${DATAPATH}/datasets/${dataset_name}/${split}_imgs.img_feat.jsonl" \
248
+ --text-feats="${DATAPATH}/datasets/${dataset_name}/${split}_texts.txt_feat.jsonl" \
249
+ --top-k=10 \
250
+ --eval-batch-size=32768 \
251
+ --output="${DATAPATH}/datasets/${dataset_name}/${split}_tr_predictions.jsonl"
252
+
253
+ python eval/evaluation.py \
254
+ ${DATAPATH}/datasets/${dataset_name}/${split}_texts.jsonl \
255
+ ${DATAPATH}/datasets/${dataset_name}/${split}_predictions.jsonl \
256
+ ${DATAPATH}/datasets/${dataset_name}/output1.json
257
+ cat ${DATAPATH}/datasets/${dataset_name}/output1.json
258
+
259
+ python eval/transform_ir_annotation_to_tr.py \
260
+ --input ${DATAPATH}/datasets/${dataset_name}/${split}_texts.jsonl
261
+
262
+ python eval/evaluation_tr.py \
263
+ ${DATAPATH}/datasets/${dataset_name}/${split}_texts.tr.jsonl \
264
+ ${DATAPATH}/datasets/${dataset_name}/${split}_tr_predictions.jsonl \
265
+ ${DATAPATH}/datasets/${dataset_name}/output2.json
266
+ cat ${DATAPATH}/datasets/${dataset_name}/output2.json
267
+ ```
268
+
269
+ ### ImageNet零样本分类
270
+ ImageNet零样本分类的代码参考如下
271
+ ```bash
272
+ bash scripts/zeroshot_eval.sh 0 \
273
+ ${DATAPATH} imagenet \
274
+ ViT-B-16 RoBERTa-wwm-ext-base-chinese \
275
+ ./pretrained_weights/QA-CLIP-base.pt
276
+ ```
277
+ <br><br>
278
+ # 致谢
279
+ 项目代码基于<b>[Chinese-CLIP](https://github.com/OFA-Sys/Chinese-CLIP)</b>实现,非常感谢他们优秀的开源工作。
280
+ <br><br>
config.json ADDED
@@ -0,0 +1,114 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "ChineseCLIPModel"
4
+ ],
5
+ "initializer_factor": 1.0,
6
+ "logit_scale_init_value": 2.6592,
7
+ "model_type": "chinese_clip",
8
+ "projection_dim": 512,
9
+ "text_config": {
10
+ "architectures": [
11
+ "ChineseCLIPTextModel"
12
+ ],
13
+ "attention_probs_dropout_prob": 0.1,
14
+ "bos_token_id": 0,
15
+ "directionality": "bidi",
16
+ "eos_token_id": 2,
17
+ "hidden_act": "gelu",
18
+ "hidden_dropout_prob": 0.1,
19
+ "hidden_size": 768,
20
+ "initializer_range": 0.02,
21
+ "intermediate_size": 3072,
22
+ "layer_norm_eps": 1e-12,
23
+ "max_position_embeddings": 512,
24
+ "model_type": "chinese_clip_text_model",
25
+ "num_attention_heads": 12,
26
+ "num_hidden_layers": 12,
27
+ "output_past": true,
28
+ "pad_token_id": 0,
29
+ "pooler_fc_size": 768,
30
+ "pooler_num_attention_heads": 12,
31
+ "pooler_num_fc_layers": 3,
32
+ "pooler_size_per_head": 128,
33
+ "pooler_type": "first_token_transform",
34
+ "type_vocab_size": 2,
35
+ "vocab_size": 21128
36
+ },
37
+ "text_config_dict": null,
38
+ "torch_dtype": "float32",
39
+ "transformers_version": null,
40
+ "vision_config": {
41
+ "_name_or_path": "",
42
+ "add_cross_attention": false,
43
+ "architectures": null,
44
+ "attention_dropout": 0.0,
45
+ "bad_words_ids": null,
46
+ "bos_token_id": null,
47
+ "chunk_size_feed_forward": 0,
48
+ "decoder_start_token_id": null,
49
+ "diversity_penalty": 0.0,
50
+ "do_sample": false,
51
+ "dropout": 0.0,
52
+ "early_stopping": false,
53
+ "encoder_no_repeat_ngram_size": 0,
54
+ "eos_token_id": null,
55
+ "finetuning_task": null,
56
+ "forced_bos_token_id": null,
57
+ "forced_eos_token_id": null,
58
+ "hidden_act": "quick_gelu",
59
+ "hidden_size": 768,
60
+ "id2label": {
61
+ "0": "LABEL_0",
62
+ "1": "LABEL_1"
63
+ },
64
+ "image_size": 224,
65
+ "initializer_factor": 1.0,
66
+ "initializer_range": 0.02,
67
+ "intermediate_size": 3072,
68
+ "is_decoder": false,
69
+ "is_encoder_decoder": false,
70
+ "label2id": {
71
+ "LABEL_0": 0,
72
+ "LABEL_1": 1
73
+ },
74
+ "layer_norm_eps": 1e-05,
75
+ "length_penalty": 1.0,
76
+ "max_length": 20,
77
+ "min_length": 0,
78
+ "model_type": "chinese_clip_vision_model",
79
+ "no_repeat_ngram_size": 0,
80
+ "num_attention_heads": 12,
81
+ "num_beam_groups": 1,
82
+ "num_beams": 1,
83
+ "num_hidden_layers": 12,
84
+ "num_return_sequences": 1,
85
+ "output_attentions": false,
86
+ "output_hidden_states": false,
87
+ "output_scores": false,
88
+ "pad_token_id": null,
89
+ "patch_size": 16,
90
+ "prefix": null,
91
+ "problem_type": null,
92
+ "projection_dim" : 512,
93
+ "pruned_heads": {},
94
+ "remove_invalid_values": false,
95
+ "repetition_penalty": 1.0,
96
+ "return_dict": true,
97
+ "return_dict_in_generate": false,
98
+ "sep_token_id": null,
99
+ "task_specific_params": null,
100
+ "temperature": 1.0,
101
+ "tie_encoder_decoder": false,
102
+ "tie_word_embeddings": true,
103
+ "tokenizer_class": null,
104
+ "top_k": 50,
105
+ "top_p": 1.0,
106
+ "torch_dtype": null,
107
+ "torchscript": false,
108
+ "transformers_version": "4.12.0.dev0",
109
+ "use_bfloat16": false
110
+ },
111
+ "vision_config_dict": {
112
+ "patch_size": 16
113
+ }
114
+ }
preprocessor_config.json ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "do_center_crop": false,
3
+ "do_normalize": true,
4
+ "do_resize": true,
5
+ "feature_extractor_type": "ChineseCLIPFeatureExtractor",
6
+ "image_mean": [
7
+ 0.48145466,
8
+ 0.4578275,
9
+ 0.40821073
10
+ ],
11
+ "image_std": [
12
+ 0.26862954,
13
+ 0.26130258,
14
+ 0.27577711
15
+ ],
16
+ "resample": 3,
17
+ "size": {
18
+ "height": 224,
19
+ "width": 224
20
+ }
21
+ }
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a17f44c0475d70d34c3842b30a3116bd2355b810aadffd4f55b2d0a450c3ebd6
3
+ size 377054982
vocab.txt ADDED
The diff for this file is too large to render. See raw diff