Citaman commited on
Commit
bab98fc
1 Parent(s): 165ff4a

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +251 -3
README.md CHANGED
@@ -1,3 +1,251 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # VeCLIP: Improving CLIP Training via Visual-enriched Captions
2
+
3
+ * A novel CLIP training scheme that achieves the SoTA performance on zero-shot ImageNet classification and COCO image text retreival using limited visual-enriched captions.* [[Paper](https://arxiv.org/abs/2310.07699)]
4
+
5
+ [Zhengfeng Lai*](https://zjujefflai.github.io/), [Haotian Zhang*](https://haotian-zhang.github.io/) , [Bowen Zhang](https://zbwglory.github.io/), Wentao Wu, Haoping Bai, Aleksei Timofeev, Xianzhi Du, [Zhe Gan](https://zhegan27.github.io/), Jiulong Shan, [Chen-Nee Chuah](https://www.ece.ucdavis.edu/~chuah/rubinet/people/chuah/bio.html), Yinfei Yang, Meng Cao [*: equal contribution]
6
+
7
+
8
+ <p align="center">
9
+ <img src="figs/veclip_diagram.jpg" width="100%"></a> <br>
10
+ Diagram of VeCap.
11
+ </p>
12
+
13
+ ## Release
14
+ - [03/06/2024] 🔥 We released the VeCLIP & VeCap-DFN [checkpoints](#checkpoints).
15
+
16
+ ## Contents
17
+ - [Install](#install)
18
+ - [Getting Started](#getting-started)
19
+ - [Checkpoints](#checkpoints)
20
+
21
+ ## Install
22
+
23
+ 1. Clone this repository
24
+ ```Shell
25
+ git clone https://github.com/apple/ml-veclip
26
+ cd ml-veclip
27
+ ```
28
+
29
+ 2. Create an environment and install related packages
30
+ ```Shell
31
+ conda create -n veclip python=3.9 -y
32
+ conda activate veclip
33
+ pip install -r requirements.txt
34
+ ```
35
+
36
+ ## Getting Started
37
+
38
+ See the [example notebook](https://github.com/apple/ml-veclip/blob/main/load_veclip.ipynb) for details on how to simply load the different checkpoints using HuggingFace transformers.
39
+
40
+
41
+ ## Checkpoints
42
+
43
+ We release the checkpoints for **VeCLIP**, which are trained from scratch on visual-enriched captions VeCap 3M/12M/100M/200M, as reported in the paper. The models are evaluated on COCO/Flickr30k image-text retrieval and ImageNet/ImageNetv2 classification in a zero-shot fashion. Use `wget` or `curl` to download the below checkpoints.
44
+
45
+ <table>
46
+ <thead>
47
+ <tr>
48
+ <th rowspan="2">Data</th>
49
+ <th rowspan="2">Model</th>
50
+ <th rowspan="2">Resolution</th>
51
+ <th colspan="2">COCO (R@1)</th>
52
+ <th colspan="2">Flickr30k (R@1)</th>
53
+ <th rowspan="2">ImageNet</th>
54
+ <th rowspan="2">ImageNetv2</th>
55
+ </tr>
56
+ <tr>
57
+ <th>I2T</th>
58
+ <th>T2I</th>
59
+ <th>I2T</th>
60
+ <th>T2I</th>
61
+ </tr>
62
+ </thead>
63
+ <tbody>
64
+ <tr>
65
+ <td rowspan="2">VeCap 3M</td>
66
+ <td>CLIP-B/16</td>
67
+ <td>224x224</td>
68
+ <td>5.46</td>
69
+ <td>3.28</td>
70
+ <td>12.20</td>
71
+ <td>6.36</td>
72
+ <td>5.46</td>
73
+ <td>7.09</td>
74
+ </tr>
75
+ <tr>
76
+ <td><a href="https://docs-assets.developer.apple.com/ml-research/models/veclip/veclip_b16_3m.zip">VeCLIP-B/16</a></td>
77
+ <td>224x224</td>
78
+ <td>22.30</td>
79
+ <td>13.01</td>
80
+ <td>40.60</td>
81
+ <td>27.58</td>
82
+ <td>15.98</td>
83
+ <td>13.51</td>
84
+ </tr>
85
+ <tr>
86
+ <td rowspan="2">VeCap 12M</td>
87
+ <td>CLIP-B/16</td>
88
+ <td>224x224</td>
89
+ <td>24.52</td>
90
+ <td>14.28</td>
91
+ <td>44.70</td>
92
+ <td>290.6</td>
93
+ <td>31.60</td>
94
+ <td>27.03</td>
95
+ </tr>
96
+ <tr>
97
+ <td><a href="https://docs-assets.developer.apple.com/ml-research/models/veclip/veclip_b16_12m.zip">VeCLIP-B/16</a></td>
98
+ <td>224x224</td>
99
+ <td>47.78</td>
100
+ <td>31.62</td>
101
+ <td>73.90</td>
102
+ <td>55.68</td>
103
+ <td>38.11</td>
104
+ <td>32.53</td>
105
+ </tr>
106
+ <tr>
107
+ <td rowspan="2">VeCap 100M</td>
108
+ <td>CLIP-B/16</td>
109
+ <td>224x224</td>
110
+ <td>47.24</td>
111
+ <td>30.61</td>
112
+ <td>74.40</td>
113
+ <td>57.16</td>
114
+ <td>58.64</td>
115
+ <td>50.96</td>
116
+ </tr>
117
+ <tr>
118
+ <td><a href="https://docs-assets.developer.apple.com/ml-research/models/veclip/veclip_b16_100m.zip">VeCLIP-B/16</a></td>
119
+ <td>224x224</td>
120
+ <td>64.82</td>
121
+ <td>46.12</td>
122
+ <td>89.30</td>
123
+ <td>73.10</td>
124
+ <td>60.77</td>
125
+ <td>54.17</td>
126
+ </tr>
127
+ <tr>
128
+ <td rowspan="2">VeCap 200M</td>
129
+ <td>CLIP-B/16</td>
130
+ <td>224x224</td>
131
+ <td>52.20</td>
132
+ <td>34.97</td>
133
+ <td>80.90</td>
134
+ <td>63.26</td>
135
+ <td>63.72</td>
136
+ <td>56.84</td>
137
+ </tr>
138
+ <tr>
139
+ <td><a href="https://docs-assets.developer.apple.com/ml-research/models/veclip/veclip_b16_200m.zip">VeCLIP-B/16</a></td>
140
+ <td>224x224</td>
141
+ <td>67.20</td>
142
+ <td>48.40</td>
143
+ <td>91.10</td>
144
+ <td>76.32</td>
145
+ <td>64.64</td>
146
+ <td>57.67</td>
147
+ </tr>
148
+ </tbody>
149
+ </table>
150
+
151
+
152
+ We further found our VeCap can also be complementary to other well-established filtering methods, e.g., [Data Filtering Network (DFN)](ttps://arxiv.org/abs/2309.17425). We also provide thosse checkpoints (referred to as **VeCap-DFN**) and report their performance below.
153
+
154
+ <table>
155
+ <thead>
156
+ <tr>
157
+ <th rowspan="2">Backbone</th>
158
+ <th rowspan="2">Resolution</th>
159
+ <th rowspan="2">Data</th>
160
+ <th colspan="2">COCO (R@1)</th>
161
+ <th colspan="2">Flickr30k (R@1)</th>
162
+ <th rowspan="2">ImageNet</th>
163
+ <th rowspan="2">ImageNetV2</th>
164
+ </tr>
165
+ <tr>
166
+ <th>I2T</th>
167
+ <th>T2I</th>
168
+ <th>I2T</th>
169
+ <th>T2I</th>
170
+
171
+ </tr>
172
+ </thead>
173
+ <tbody>
174
+ <tr>
175
+ <td rowspan="3"><a href="https://docs-assets.developer.apple.com/ml-research/models/veclip/vecapdfn_clip_b16.zip">VeCap-DFN-B/16</a></td>
176
+ <td rowspan="3">224x224</td>
177
+ <td>DFN </td>
178
+ <td>62.96</td>
179
+ <td>43.20</td>
180
+ <td>87.10</td>
181
+ <td>70.44</td>
182
+ <td>76.15</td>
183
+ <td>68.19</td>
184
+ </tr>
185
+ <tr>
186
+ <td>VeCap 300M</td>
187
+ <td>64.74</td>
188
+ <td>44.58</td>
189
+ <td>90.10</td>
190
+ <td>73.14</td>
191
+ <td>46.43</td>
192
+ <td>41.15</td>
193
+ </tr>
194
+ <tr>
195
+ <td>DFN + VeCap 300M</td>
196
+ <td>66.28</td>
197
+ <td>45.12</td>
198
+ <td>88.80</td>
199
+ <td>73.56</td>
200
+ <td>76.19</td>
201
+ <td>69.58</td>
202
+ </tr>
203
+ <tr>
204
+ <td rowspan="1"><a href="https://docs-assets.developer.apple.com/ml-research/models/veclip/vecapdfn_clip_l14.zip">VeCap-DFN-L/14</a></td>
205
+ <td rowspan="1">224x224</td>
206
+ <td>DFN + VeCap 300M</td>
207
+ <td>71.06</td>
208
+ <td>51.13</td>
209
+ <td>93.10</td>
210
+ <td>80.96</td>
211
+ <td>81.95</td>
212
+ <td>75.48</td>
213
+ </tr>
214
+ <tr>
215
+ <td rowspan="2"><a href="https://docs-assets.developer.apple.com/ml-research/models/veclip/vecapdfn_clip_h14_336.zip">VeCap-DFN-H/14</a></td>
216
+ <td rowspan="1">336x336</td>
217
+ <td>DFN + VeCap 300M</td>
218
+ <td>72.78</td>
219
+ <td>52.33</td>
220
+ <td>93.60</td>
221
+ <td>82.64</td>
222
+ <td>83.07</td>
223
+ <td>76.37</td>
224
+ </tr>
225
+ </tbody>
226
+ </table>
227
+
228
+
229
+ ## Citation
230
+
231
+ If you find VeCLIP useful, please cite using this BibTeX:
232
+
233
+ ```bibtex
234
+ @article{lai2023scarcity,
235
+ title={From scarcity to efficiency: Improving clip training via visual-enriched captions},
236
+ author={Lai, Zhengfeng and Zhang, Haotian and Zhang, Bowen and Wu, Wentao and Bai, Haoping and Timofeev, Aleksei and Du, Xianzhi and Gan, Zhe and Shan, Jiulong and Chuah, Chen-Nee and Yang, Yinfei and others},
237
+ journal={arXiv preprint arXiv:2310.07699},
238
+ year={2023}
239
+ }
240
+ @article{fang2023data,
241
+ title={Data filtering networks},
242
+ author={Fang, Alex and Jose, Albin Madappally and Jain, Amit and Schmidt, Ludwig and Toshev, Alexander and Shankar, Vaishaal},
243
+ journal={arXiv preprint arXiv:2309.17425},
244
+ year={2023}
245
+ }
246
+ ```
247
+
248
+ ## Acknowledgement
249
+
250
+ - [axlearn](https://github.com/apple/axlearn): the codebase we use to train the models.
251
+ - [huggingface transformers](https://huggingface.co/docs/transformers/en/index): Transformers provides APIs to load our trained models.