OpenCLIP
Safetensors
Portuguese
CAPIVARA
Portuguese CLIP
OpenCLIP
gabrielsantosrv commited on
Commit
0982c52
1 Parent(s): a9902b6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +156 -0
README.md CHANGED
@@ -1,3 +1,159 @@
1
  ---
2
  license: mit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
+ language:
4
+ - pt
5
+ tags:
6
+ - CAPIVARA
7
+ - Portuguese CLIP
8
+ - Portuguese
9
+ - OpenCLIP
10
+ datasets:
11
+ - conceptual_captions
12
+ - PraCegoVer
13
+ - MS_COCO
14
+ - Flickr30K
15
+ - ImageNet
16
+ - ELEVATER
17
  ---
18
+ # Model Card for CAPIVARA
19
+
20
+ CAPIVARA is a cost-efficient framework designed to enhance the performance of multilingual CLIP models in low-resource languages.
21
+ This model holds the state of the art in many zero-shot tasks involving images and Portuguese texts.
22
+
23
+ ## Model Details
24
+
25
+ ### Model Description
26
+
27
+ CAPIVARA is built upon pre-trained [OpenCLIP ViT-B/32 XLM-Roberta Base](https://huggingface.co/laion/CLIP-ViT-B-32-xlm-roberta-base-laion5B-s13B-b90k) and
28
+ fine-tuned with [Conceptual Captions](https://aclanthology.org/P18-1238.pdf) and synthetic captions generated by [BLIP2](https://huggingface.co/Salesforce/blip2-opt-2.7b-coco).
29
+ All the captions are translated with Google Translator.
30
+
31
+
32
+ ## Uses
33
+
34
+ ### Direct Use
35
+
36
+ Zero-shot image classification, zero-shot image and text retrieval, etc.
37
+
38
+ ### Downstream Use
39
+
40
+ Image classification and other image task fine-tuning, linear probe image classification,
41
+ image captioning, image generation guiding and conditioning, etc.
42
+
43
+
44
+ ## Ethical considerations
45
+
46
+ For ethical considerations, please, read the Model Cards section in the [paper](https://arxiv.org/abs/2310.13683).
47
+
48
+ ## Training Details
49
+
50
+ ### Training Data
51
+ The model was fine-tuned with [Conceptual Captions](https://aclanthology.org/P18-1238.pdf) and synthetic captions generated by [BLIP2](https://huggingface.co/Salesforce/blip2-opt-2.7b-coco).
52
+ All the captions are translated with Google Translator.
53
+
54
+ #### Training Hyperparameters
55
+ ```
56
+ Optimizer: "Adam"
57
+ eps: 1e-8
58
+ weight_decay: 0.2
59
+ betas: [ 0.9, 0.98 ]
60
+
61
+ LR_scheduler: "CosineWarmupLR"
62
+ min_learning_rate: 1e-7
63
+ max_learning_rate: 5e-7
64
+ warmup_lr: 500
65
+
66
+ batch_size: 2816
67
+ max_steps: 5863 # 10 epochs
68
+ ```
69
+
70
+ ## Evaluation
71
+
72
+ + [Zero-shot image classification](https://github.com/hiaac-nlp/CAPIVARA/blob/main/clip_pt/src/evaluate/capivara_classification.ipynb)
73
+ + [Zero-shot cross-modal retrieval](https://github.com/hiaac-nlp/CAPIVARA/blob/main/clip_pt/src/evaluate/capivara_retrieval.ipynb)
74
+
75
+
76
+ ### Testing Data, Factors & Metrics
77
+
78
+ #### Testing Data
79
+
80
+ For cross-modal retrieval, we used [PraCegoVer](https://www.mdpi.com/2306-5729/7/2/13), which is composed of images annotated originally
81
+ with Portuguese texts, and our Portuguese-translated versions of [MS COCO](https://link.springer.com/chapter/10.1007/978-3-319-10602-1_48)
82
+ and [Flickr30k](https://openaccess.thecvf.com/content_iccv_2015/papers/Plummer_Flickr30k_Entities_Collecting_ICCV_2015_paper.pdf).
83
+ We also translate the labels from [ImageNet](https://ieeexplore.ieee.org/document/5206848) and the
84
+ [ELEVATER](https://proceedings.neurips.cc/paper_files/paper/2022/hash/3c4688b6a76f25f2311daa0d75a58f1a-Abstract-Datasets_and_Benchmarks.html)
85
+ benchmark datasets for image classification.
86
+
87
+ ### Results
88
+
89
+ #### Zero-shot Cross-Modal Retrieval
90
+
91
+ We conducted zero-shot cross-modal retrieval experiments on Flickr30k and MS COCO with captions
92
+ translated into Portuguese, and PraCegoVer. We report the average and standard deviation for 3 runs.
93
+
94
+ <table>
95
+ <thead>
96
+ <tr>
97
+ <th>Models</th>
98
+ <th colspan="2">Flickr30k</th>
99
+ <th colspan="2"> MS COCO</th>
100
+ <th colspan="2">PraCegoVer</th>
101
+ </tr>
102
+ </thead>
103
+ <tbody>
104
+ <tr>
105
+ <td></td>
106
+ <td>text-to-image</td>
107
+ <td> image-to-text</td>
108
+ <td>text-to-image</td>
109
+ <td> image-to-text</td>
110
+ <td>text-to-image</td>
111
+ <td> image-to-text</td>
112
+ </tr>
113
+ <tr>
114
+ <td>OpenCLIP ViT-B/32 XLM-Roberta Base (Baseline)</td>
115
+ <td>76.23</td>
116
+ <td>87.93</td>
117
+ <td>52.62</td>
118
+ <td>66.55</td>
119
+ <td>65.36</td>
120
+ <td><b>69.43</b></td>
121
+ </tr>
122
+ <tr>
123
+ <td>CAPIVARA</td>
124
+ <td><b>79.56 ± 0.01</b></td>
125
+ <td><b>89.95 ± 0.04</b></td>
126
+ <td><b>56.27 ± 0.01</b></td>
127
+ <td><b>71.24 ± 0.01</b></td>
128
+ <td><b>66.40 ± 0.01</b></td>
129
+ <td>64.75 ± 0.01</td>
130
+ </tr>
131
+ </tbody>
132
+ </table>
133
+
134
+ #### Zero-shot image classification
135
+
136
+ | Models | **Caltech-101** | **CIFAR-10** | **CIFAR-100** | **Country-211** | **DTD** | **EuroSAT** | **FER-2013** | **FGVC-Aircraft** | **Food-101** | **GTSRB** | **Hateful-Memes** | **KITTI-Distance** | **MNIST** | **Oxford Flowers-102** | **Oxford-IIIT Pets** | **PatchCamelyon** | **Rendered-SST2** | **RESISC-45** | **Stanford-Cars** | **PASCAL VOC-2007** | **Average** | **ImageNet-1k** |
137
+ |:-----------------------:|:---------------:|:------------:|:-------------:|:---------------:|:------------:|:------------:|:------------:|:-----------------:|:------------:|:------------:|:-----------------:|:------------------:|:------------:|:----------------------:|:--------------------:|:-----------------:|:-----------------:|:-------------:|:-----------------:|:-------------------:|:------------:|:---------------:|
138
+ | OpenCLIP ViT-B/32 XLM-Roberta Base (Baseline) | 84.53 ± 0.00 | 93.99 ± 0.00 | 68.44 ± 0.00 | 17.82 ± 0.00 | 41.17 ± 0.00 | 47.16 ± 0.00 | 48.65 ± 0.00 | 26.30 ± 0.00 | 65.06 ± 0.00 | 43.27 ± 0.00 | 56.50 ± 0.00 | 28.41 ± 0.00 | 54.99 ± 0.00 | 50.88 ± 0.00 | 81.56 ± 0.00 | 50.96 ± 0.00 | 54.20 ± 0.00 | 58.51 ± 0.00 | 84.93 ± 0.00 | 82.09 ± 0.00 | 56.97 ± 0.00 | 45.84 ± 0.00 |
139
+ | CAPIVARA | 82.97 ± 0.03 | 93.85 ± 0.00 | 69.37 ± 0.01 | 17.61 ± 0.00 | 42.34 ± 0.04 | 47.77 ± 0.02 | 46.68 ± 0.05 | 25.49 ± 0.01 | 64.58 ± 0.01 | 46.34 ± 0.01 | 56.17 ± 0.00 | 33.94 ± 0.13 | 60.14 ± 0.04 | 49.93 ± 0.02 | 79.37 ± 0.00 | 51.71 ± 0.01 | 54.82 ± 0.03 | 59.71 ± 0.01 | 85.10 ± 0.02 | 82.29 ± 0.00 | **57.51 ± 0.02** | **46.06 ± 0.01** |
140
+
141
+ ## Environmental Impact
142
+
143
+ - **GPU:** 1 x Quadro RTX 8000 (48 GB)
144
+ - **Hours used:** 31 hours
145
+ - **Compute Region:** Brazil
146
+ - **Carbon footprint:** 0.5 Kg
147
+ - **Energy**: 6.49 kW
148
+
149
+
150
+ ## Citation
151
+
152
+ ```bibtex
153
+ @inproceedings{santos2023capivara,
154
+ title={CAPIVARA: Cost-Efficient Approach for Improving Multilingual CLIP Performance on Low-Resource Languages},
155
+ author={Santos, Gabriel O. dos and Moreira, Diego A. B. and Ferreira, Alef I. and Silva, Jhessica and Pereira, Luiz and Bueno, Pedro and Sousa, Thiago and Maia, Helena and da Silva, N{\'a}dia and Colombini, Esther and Pedrini, Helio and Avila, Sandra},
156
+ booktitle = "Workshop on Multi-lingual Representation Learning (MRL), Conference on Empirical Methods in Natural Language Processing (EMNLP)",
157
+ year = "2023"
158
+ }
159
+ ```