Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,125 @@
|
|
1 |
---
|
2 |
license: mit
|
|
|
|
|
|
|
|
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: mit
|
3 |
+
widget:
|
4 |
+
- src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/cat-dog-music.png
|
5 |
+
candidate_labels: playing music, playing sports
|
6 |
+
example_title: Cat & Dog
|
7 |
---
|
8 |
+
# Model Card for CLIP ViT-H/14 frozen xlm roberta large - LAION-5B
|
9 |
+
|
10 |
+
# Table of Contents
|
11 |
+
|
12 |
+
1. [Model Details](#model-details)
|
13 |
+
2. [Uses](#uses)
|
14 |
+
3. [Training Details](#training-details)
|
15 |
+
4. [Evaluation](#evaluation)
|
16 |
+
5. [Acknowledgements](#acknowledgements)
|
17 |
+
6. [Citation](#citation)
|
18 |
+
7. [How To Get Started With the Model](#how-to-get-started-with-the-model)
|
19 |
+
|
20 |
+
|
21 |
+
# Model Details
|
22 |
+
|
23 |
+
## Model Description
|
24 |
+
|
25 |
+
A CLIP ViT-H/14 frozen xlm roberta large model trained with the LAION-5B (https://laion.ai/blog/laion-5b/) using OpenCLIP (https://github.com/mlfoundations/open_clip).
|
26 |
+
|
27 |
+
Model training done by Romain Beaumont on the [stability.ai](https://stability.ai/) cluster.
|
28 |
+
|
29 |
+
# Uses
|
30 |
+
|
31 |
+
## Direct Use
|
32 |
+
|
33 |
+
Zero-shot image classification, image and text retrieval, among others.
|
34 |
+
|
35 |
+
## Downstream Use
|
36 |
+
|
37 |
+
Image classification and other image task fine-tuning, linear probe image classification, image generation guiding and conditioning, among others.
|
38 |
+
|
39 |
+
# Training Details
|
40 |
+
|
41 |
+
## Training Data
|
42 |
+
|
43 |
+
This model was trained with the full LAION-5B (https://laion.ai/blog/laion-5b/).
|
44 |
+
|
45 |
+
## Training Procedure
|
46 |
+
|
47 |
+
Training with batch size 90k for 13B sample of laion5B, see https://wandb.ai/rom1504/open-clip/reports/xlm-roberta-large-unfrozen-vit-h-14-frozen--VmlldzoyOTc3ODY3
|
48 |
+
|
49 |
+
Model is H/14 on visual side, xlm roberta large initialized with pretrained weights on text side.
|
50 |
+
|
51 |
+
The H/14 was initialized from https://huggingface.co/laion/CLIP-ViT-H-14-laion2B-s32B-b79K and kept frozen during training.
|
52 |
+
|
53 |
+
# Evaluation
|
54 |
+
|
55 |
+
Evaluation done with code in the [LAION CLIP Benchmark suite](https://github.com/LAION-AI/CLIP_benchmark).
|
56 |
+
|
57 |
+
## Testing Data, Factors & Metrics
|
58 |
+
|
59 |
+
### Testing Data
|
60 |
+
|
61 |
+
The testing is performed with VTAB+ (A combination of VTAB (https://arxiv.org/abs/1910.04867) w/ additional robustness datasets) for classification and COCO and Flickr for retrieval.
|
62 |
+
|
63 |
+
## Results
|
64 |
+
|
65 |
+
The model achieves imagenet 1k 77.0% (vs 78% for the english H/14)
|
66 |
+
![results_xlm_roberta_large.png](results_xlm_roberta_large.png)
|
67 |
+
|
68 |
+
On zero shot classification on imagenet with translated prompts this model reaches:
|
69 |
+
* 56% in italian (vs 21% for https://github.com/clip-italian/clip-italian)
|
70 |
+
* 53% in japanese (vs 54.6% for https://github.com/rinnakk/japanese-clip)
|
71 |
+
* 55.7% in chinese (to be compared with https://github.com/OFA-Sys/Chinese-CLIP)
|
72 |
+
|
73 |
+
This model reaches strong results in both english and other languages.
|
74 |
+
|
75 |
+
|
76 |
+
# Acknowledgements
|
77 |
+
|
78 |
+
Acknowledging [stability.ai](https://stability.ai/) for the compute used to train this model.
|
79 |
+
|
80 |
+
# Citation
|
81 |
+
|
82 |
+
**BibTeX:**
|
83 |
+
|
84 |
+
In addition to forthcoming LAION-5B (https://laion.ai/blog/laion-5b/) paper, please cite:
|
85 |
+
|
86 |
+
OpenAI CLIP paper
|
87 |
+
```
|
88 |
+
@inproceedings{Radford2021LearningTV,
|
89 |
+
title={Learning Transferable Visual Models From Natural Language Supervision},
|
90 |
+
author={Alec Radford and Jong Wook Kim and Chris Hallacy and A. Ramesh and Gabriel Goh and Sandhini Agarwal and Girish Sastry and Amanda Askell and Pamela Mishkin and Jack Clark and Gretchen Krueger and Ilya Sutskever},
|
91 |
+
booktitle={ICML},
|
92 |
+
year={2021}
|
93 |
+
}
|
94 |
+
```
|
95 |
+
|
96 |
+
OpenCLIP software
|
97 |
+
```
|
98 |
+
@software{ilharco_gabriel_2021_5143773,
|
99 |
+
author = {Ilharco, Gabriel and
|
100 |
+
Wortsman, Mitchell and
|
101 |
+
Wightman, Ross and
|
102 |
+
Gordon, Cade and
|
103 |
+
Carlini, Nicholas and
|
104 |
+
Taori, Rohan and
|
105 |
+
Dave, Achal and
|
106 |
+
Shankar, Vaishaal and
|
107 |
+
Namkoong, Hongseok and
|
108 |
+
Miller, John and
|
109 |
+
Hajishirzi, Hannaneh and
|
110 |
+
Farhadi, Ali and
|
111 |
+
Schmidt, Ludwig},
|
112 |
+
title = {OpenCLIP},
|
113 |
+
month = jul,
|
114 |
+
year = 2021,
|
115 |
+
note = {If you use this software, please cite it as below.},
|
116 |
+
publisher = {Zenodo},
|
117 |
+
version = {0.1},
|
118 |
+
doi = {10.5281/zenodo.5143773},
|
119 |
+
url = {https://doi.org/10.5281/zenodo.5143773}
|
120 |
+
}
|
121 |
+
```
|
122 |
+
|
123 |
+
# How To Get Started With the Model
|
124 |
+
|
125 |
+
https://github.com/mlfoundations/open_clip
|