File size: 8,945 Bytes
42bec8a
 
f19719d
 
72d0897
f19719d
42bec8a
f19719d
 
 
416f0d6
f19719d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bbeb837
f19719d
 
 
 
 
bbeb837
f19719d
 
 
 
 
 
ff1624b
f19719d
bbeb837
ff1624b
f19719d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bbeb837
f19719d
 
 
 
 
bbeb837
f19719d
 
 
 
 
 
 
 
bbeb837
f19719d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bbeb837
f19719d
 
 
 
 
bbeb837
f19719d
 
 
 
 
 
 
 
bbeb837
f19719d
 
 
 
 
 
 
 
 
 
 
 
 
bbeb837
f19719d
 
 
 
 
bbeb837
f19719d
 
 
 
 
bbeb837
f19719d
 
 
 
 
 
 
 
 
 
 
 
 
5faf483
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f19719d
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
---
license: apache-2.0
widget:
- src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/cat-dog-music.png
  candidate_labels: 音乐表演, 体育运动
  example_title: 猫和狗
---
[**中文说明**](README_CN.md) | [**English**](README.md)
# Introduction
This project aims to provide a better Chinese CLIP model. The training data used in this project consists of publicly accessible image URLs and related Chinese text descriptions, totaling 400 million. After screening, we ultimately used 100 million data for training. 
This project is produced by QQ-ARC Joint Lab, Tencent PCG. For more detailed information, please refer to the [main page of the QA-CLIP project](https://huggingface.co/TencentARC/QA-CLIP). We have also open-sourced our code on GitHub, [QA-CLIP](https://github.com/TencentARC-QQ/QA-CLIP), and welcome to star!
<br><br>

## Results
We conducted zero-shot tests on [MUGE Retrieval](https://tianchi.aliyun.com/muge), [Flickr30K-CN](https://github.com/li-xirong/cross-lingual-cap), and [COCO-CN](https://github.com/li-xirong/coco-cn) datasets for image-text retrieval tasks. For the image zero-shot classification task, we tested on the ImageNet dataset. The test results are shown in the table below:

**Flickr30K-CN Zero-shot Retrieval (Official Test Set)**:
<table border="1" width="120%">
	<tr align="center">
        <th>Task</th><th colspan="3">Text-to-Image</th><th colspan="3">Image-to-Text</th>
    </tr>
    <tr align="center">
        <td>Metric</td><td>R@1</td><td>R@5</td><td>R@10</td><td>R@1</td><td>R@5</td><td>R@10</td>
    </tr>
	<tr align="center">
        <td width="120%">CN-CLIP<sub>RN50</sub></td><td>48.8</td><td>76.0</td><td>84.6</td><td>60.0</td><td>85.9</td><td>92.0</td>
    </tr>
	<tr align="center", style="background-color: Honeydew;">
        <td width="120%">QA-CLIP<sub>RN50</sub></td><td><b>50.5</b></td><td><b>77.4</b></td><td><b>86.1</b></td><td><b>67.1</b></td><td><b>87.9</b></td><td><b>93.2</b></td>
    </tr>
	<tr align="center">
        <td width="120%">CN-CLIP<sub>ViT-B/16</sub></td><td>62.7</td><td>86.9</td><td>92.8</td><td>74.6</td><td>93.5</td><td>97.1</td>
    </tr>  
	<tr align="center", style="background-color: Honeydew;">
        <td width="120%">QA-CLIP<sub>ViT-B/16</sub></td><td><b>63.8</b></td><td><b>88.0</b></td><td><b>93.2</b></td><td><b>78.4</b></td><td><b>96.1</b></td><td><b>98.5</b></td>
    </tr> 
	<tr align="center">
        <td width="120%">CN-CLIP<sub>ViT-L/14</sub></td><td>68.0</td><td>89.7</td><td>94.4</td><td>80.2</td><td>96.6</td><td>98.2</td>
    </tr> 
	<tr align="center">
        <td width="120%">AltClip<sub>ViT-L/14</sub></td><td><b>69.7</b></td><td>90.1</td><td><b>94.8</b></td><td>84.8</td><td>97.7</td><td>99.1</td>
    </tr>
	<tr align="center", style="background-color: Honeydew;">
        <td width="120%">QA-CLIP<sub>ViT-L/14</sub></td><td>69.3</td><td><b>90.3</b></td><td>94.7</td><td><b>85.3</b></td><td><b>97.9</b></td><td><b>99.2</b></td>
    </tr>
</table>
<br>

**MUGE Zero-shot Retrieval (Official Validation Set)**:
<table border="1" width="120%">
	<tr align="center">
        <th>Task</th><th colspan="3">Text-to-Image</th><th colspan="3">Image-to-Text</th>
    </tr>
    <tr align="center">
        <td>Metric</td><td>R@1</td><td>R@5</td><td>R@10</td><td>R@1</td><td>R@5</td><td>R@10</td>
    </tr>
	<tr align="center">
        <td width="120%">CN-CLIP<sub>RN50</sub></td><td>42.6</td><td>68.5</td><td>78.0</td><td>30.0</td><td>56.2</td><td>66.9</td>
    </tr>
	<tr align="center", style="background-color: Honeydew;">
        <td width="120%">QA-CLIP<sub>RN50</sub></td><td><b>44.0</b></td><td><b>69.9</b></td><td><b>79.5</b></td><td><b>32.4</b></td><td><b>59.5</b></td><td><b>70.3</b></td>
    </tr>
	<tr align="center">
        <td width="120%">CN-CLIP<sub>ViT-B/16</sub></td><td>52.1</td><td>76.7</td><td>84.4</td><td>38.7</td><td>65.6</td><td>75.1</td>
    </tr>  
	<tr align="center", style="background-color: Honeydew;">
        <td width="120%">QA-CLIP<sub>ViT-B/16</sub></td><td><b>53.2</b></td><td><b>77.7</b></td><td><b>85.1</b></td><td><b>40.7</b></td><td><b>68.2</b></td><td><b>77.2</b></td>
    </tr> 
	<tr align="center">
        <td width="120%">CN-CLIP<sub>ViT-L/14</sub></td><td>56.4</td><td>79.8</td><td>86.2</td><td>42.6</td><td>69.8</td><td>78.6</td>
    </tr> 
	<tr align="center">
        <td width="120%">AltClip<sub>ViT-L/14</sub></td><td>29.6</td><td>49.9</td><td>58.8</td><td>21.4</td><td>42.0</td><td>51.9</td>
    </tr>
	<tr align="center", style="background-color: Honeydew;">
        <td width="120%">QA-CLIP<sub>ViT-L/14</sub></td><td><b>57.4</b></td><td><b>81.0</b></td><td><b>87.7</b></td><td><b>45.5</b></td><td><b>73.0</b></td><td><b>81.4</b></td>
    </tr>
</table>
<br>

**COCO-CN Zero-shot Retrieval (Official Test Set)**:
<table border="1" width="120%">
	<tr align="center">
        <th>Task</th><th colspan="3">Text-to-Image</th><th colspan="3">Image-to-Text</th>
    </tr>
    <tr align="center">
        <td>Metric</td><td>R@1</td><td>R@5</td><td>R@10</td><td>R@1</td><td>R@5</td><td>R@10</td>
    </tr>
	<tr align="center">
        <td width="120%">CN-CLIP<sub>RN50</sub></td><td>48.1</td><td>81.3</td><td>90.5</td><td>50.9</td><td>81.1</td><td>90.5</td>
    </tr>
	<tr align="center", style="background-color: Honeydew;">
        <td width="120%">QA-CLIP<sub>RN50</sub></td><td><b>50.1</b></td><td><b>82.5</b></td><td><b>91.7</b></td><td><b>56.7</b></td><td><b>85.2</b></td><td><b>92.9</b></td>
    </tr>
	<tr align="center">
        <td width="120%">CN-CLIP<sub>ViT-B/16</sub></td><td>62.2</td><td>87.1</td><td>94.9</td><td>56.3</td><td>84.0</td><td>93.3</td>
    </tr>  
	<tr align="center", style="background-color: Honeydew;">
        <td width="120%">QA-CLIP<sub>ViT-B/16</sub></td><td><b>62.9</b></td><td><b>87.7</b></td><td><b>94.7</b></td><td><b>61.5</b></td><td><b>87.6</b></td><td><b>94.8</b></td>
    </tr> 
	<tr align="center">
        <td width="120%">CN-CLIP<sub>ViT-L/14</sub></td><td>64.9</td><td>88.8</td><td>94.2</td><td>60.6</td><td>84.4</td><td>93.1</td>
    </tr> 
	<tr align="center">
        <td width="120%">AltClip<sub>ViT-L/14</sub></td><td>63.5</td><td>87.6</td><td>93.5</td><td>62.6</td><td><b>88.5</b></td><td><b>95.9</b></td>
    </tr>
	<tr align="center", style="background-color: Honeydew;">
        <td width="120%">QA-CLIP<sub>ViT-L/14</sub></td><td><b>65.7</b></td><td><b>90.2</b></td><td><b>95.0</b></td><td><b>64.5</b></td><td>88.3</td><td>95.1</td>
    </tr>
</table>
<br>

**Zero-shot Image Classification on ImageNet**:
<table border="1" width="120%">
	<tr align="center">
        <th>Task</th><th colspan="1">ImageNet</th>
    </tr>
	<tr align="center">
        <td width="120%">CN-CLIP<sub>RN50</sub></td><td>33.5</td>
    </tr>
	<tr align="center", style="background-color: Honeydew;">
        <td width="120%">QA-CLIP<sub>RN50</sub></td><td><b>35.5</b></td>
    </tr>
	<tr align="center">
        <td width="120%">CN-CLIP<sub>ViT-B/16</sub></td><td>48.4</td>
    </tr>  
	<tr align="center", style="background-color: Honeydew;">
        <td width="120%">QA-CLIP<sub>ViT-B/16</sub></td><td><b>49.7</b></td>
    </tr> 
	<tr align="center">
        <td width="120%">CN-CLIP<sub>ViT-L/14</sub></td><td>54.7</td>
    </tr>
	<tr align="center", style="background-color: Honeydew;">
        <td width="120%">QA-CLIP<sub>ViT-L/14</sub></td><td><b>55.8</b></td>
    </tr>
</table>
<br>

<br><br>


# Getting Started
## Inference Code
Inference code example:
```python
from PIL import Image
import requests
from transformers import ChineseCLIPProcessor, ChineseCLIPModel

model = ChineseCLIPModel.from_pretrained("TencentARC/QA-CLIP-ViT-B-16")
processor = ChineseCLIPProcessor.from_pretrained("TencentARC/QA-CLIP-ViT-B-16")

url = "https://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/pokemon.jpeg"
image = Image.open(requests.get(url, stream=True).raw)
# Squirtle, Bulbasaur, Charmander, Pikachu in English
texts = ["杰尼龟", "妙蛙种子", "小火龙", "皮卡丘"]

# compute image feature
inputs = processor(images=image, return_tensors="pt")
image_features = model.get_image_features(**inputs)
image_features = image_features / image_features.norm(p=2, dim=-1, keepdim=True)  # normalize

# compute text features
inputs = processor(text=texts, padding=True, return_tensors="pt")
text_features = model.get_text_features(**inputs)
text_features = text_features / text_features.norm(p=2, dim=-1, keepdim=True)  # normalize

# compute image-text similarity scores
inputs = processor(text=texts, images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1)
```
<br><br>

# Acknowledgments
The project code is based on implementation of <b>[Chinese-CLIP](https://github.com/OFA-Sys/Chinese-CLIP)</b>, and we are very grateful for their outstanding open-source contributions.
<br><br>