File size: 6,518 Bytes
4cd49e4 0982c52 4cd49e4 0982c52 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 |
---
license: mit
language:
- pt
tags:
- CAPIVARA
- Portuguese CLIP
- Portuguese
- OpenCLIP
datasets:
- conceptual_captions
- PraCegoVer
- MS_COCO
- Flickr30K
- ImageNet
- ELEVATER
---
# Model Card for CAPIVARA
CAPIVARA is a cost-efficient framework designed to enhance the performance of multilingual CLIP models in low-resource languages.
This model holds the state of the art in many zero-shot tasks involving images and Portuguese texts.
## Model Details
### Model Description
CAPIVARA is built upon pre-trained [OpenCLIP ViT-B/32 XLM-Roberta Base](https://huggingface.co/laion/CLIP-ViT-B-32-xlm-roberta-base-laion5B-s13B-b90k) and
fine-tuned with [Conceptual Captions](https://aclanthology.org/P18-1238.pdf) and synthetic captions generated by [BLIP2](https://huggingface.co/Salesforce/blip2-opt-2.7b-coco).
All the captions are translated with Google Translator.
## Uses
### Direct Use
Zero-shot image classification, zero-shot image and text retrieval, etc.
### Downstream Use
Image classification and other image task fine-tuning, linear probe image classification,
image captioning, image generation guiding and conditioning, etc.
## Ethical considerations
For ethical considerations, please, read the Model Cards section in the [paper](https://arxiv.org/abs/2310.13683).
## Training Details
### Training Data
The model was fine-tuned with [Conceptual Captions](https://aclanthology.org/P18-1238.pdf) and synthetic captions generated by [BLIP2](https://huggingface.co/Salesforce/blip2-opt-2.7b-coco).
All the captions are translated with Google Translator.
#### Training Hyperparameters
```
Optimizer: "Adam"
eps: 1e-8
weight_decay: 0.2
betas: [ 0.9, 0.98 ]
LR_scheduler: "CosineWarmupLR"
min_learning_rate: 1e-7
max_learning_rate: 5e-7
warmup_lr: 500
batch_size: 2816
max_steps: 5863 # 10 epochs
```
## Evaluation
+ [Zero-shot image classification](https://github.com/hiaac-nlp/CAPIVARA/blob/main/clip_pt/src/evaluate/capivara_classification.ipynb)
+ [Zero-shot cross-modal retrieval](https://github.com/hiaac-nlp/CAPIVARA/blob/main/clip_pt/src/evaluate/capivara_retrieval.ipynb)
### Testing Data, Factors & Metrics
#### Testing Data
For cross-modal retrieval, we used [PraCegoVer](https://www.mdpi.com/2306-5729/7/2/13), which is composed of images annotated originally
with Portuguese texts, and our Portuguese-translated versions of [MS COCO](https://link.springer.com/chapter/10.1007/978-3-319-10602-1_48)
and [Flickr30k](https://openaccess.thecvf.com/content_iccv_2015/papers/Plummer_Flickr30k_Entities_Collecting_ICCV_2015_paper.pdf).
We also translate the labels from [ImageNet](https://ieeexplore.ieee.org/document/5206848) and the
[ELEVATER](https://proceedings.neurips.cc/paper_files/paper/2022/hash/3c4688b6a76f25f2311daa0d75a58f1a-Abstract-Datasets_and_Benchmarks.html)
benchmark datasets for image classification.
### Results
#### Zero-shot Cross-Modal Retrieval
We conducted zero-shot cross-modal retrieval experiments on Flickr30k and MS COCO with captions
translated into Portuguese, and PraCegoVer. We report the average and standard deviation for 3 runs.
<table>
<thead>
<tr>
<th>Models</th>
<th colspan="2">Flickr30k</th>
<th colspan="2"> MS COCO</th>
<th colspan="2">PraCegoVer</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>text-to-image</td>
<td> image-to-text</td>
<td>text-to-image</td>
<td> image-to-text</td>
<td>text-to-image</td>
<td> image-to-text</td>
</tr>
<tr>
<td>OpenCLIP ViT-B/32 XLM-Roberta Base (Baseline)</td>
<td>76.23</td>
<td>87.93</td>
<td>52.62</td>
<td>66.55</td>
<td>65.36</td>
<td><b>69.43</b></td>
</tr>
<tr>
<td>CAPIVARA</td>
<td><b>79.56 ± 0.01</b></td>
<td><b>89.95 ± 0.04</b></td>
<td><b>56.27 ± 0.01</b></td>
<td><b>71.24 ± 0.01</b></td>
<td><b>66.40 ± 0.01</b></td>
<td>64.75 ± 0.01</td>
</tr>
</tbody>
</table>
#### Zero-shot image classification
| Models | **Caltech-101** | **CIFAR-10** | **CIFAR-100** | **Country-211** | **DTD** | **EuroSAT** | **FER-2013** | **FGVC-Aircraft** | **Food-101** | **GTSRB** | **Hateful-Memes** | **KITTI-Distance** | **MNIST** | **Oxford Flowers-102** | **Oxford-IIIT Pets** | **PatchCamelyon** | **Rendered-SST2** | **RESISC-45** | **Stanford-Cars** | **PASCAL VOC-2007** | **Average** | **ImageNet-1k** |
|:-----------------------:|:---------------:|:------------:|:-------------:|:---------------:|:------------:|:------------:|:------------:|:-----------------:|:------------:|:------------:|:-----------------:|:------------------:|:------------:|:----------------------:|:--------------------:|:-----------------:|:-----------------:|:-------------:|:-----------------:|:-------------------:|:------------:|:---------------:|
| OpenCLIP ViT-B/32 XLM-Roberta Base (Baseline) | 84.53 ± 0.00 | 93.99 ± 0.00 | 68.44 ± 0.00 | 17.82 ± 0.00 | 41.17 ± 0.00 | 47.16 ± 0.00 | 48.65 ± 0.00 | 26.30 ± 0.00 | 65.06 ± 0.00 | 43.27 ± 0.00 | 56.50 ± 0.00 | 28.41 ± 0.00 | 54.99 ± 0.00 | 50.88 ± 0.00 | 81.56 ± 0.00 | 50.96 ± 0.00 | 54.20 ± 0.00 | 58.51 ± 0.00 | 84.93 ± 0.00 | 82.09 ± 0.00 | 56.97 ± 0.00 | 45.84 ± 0.00 |
| CAPIVARA | 82.97 ± 0.03 | 93.85 ± 0.00 | 69.37 ± 0.01 | 17.61 ± 0.00 | 42.34 ± 0.04 | 47.77 ± 0.02 | 46.68 ± 0.05 | 25.49 ± 0.01 | 64.58 ± 0.01 | 46.34 ± 0.01 | 56.17 ± 0.00 | 33.94 ± 0.13 | 60.14 ± 0.04 | 49.93 ± 0.02 | 79.37 ± 0.00 | 51.71 ± 0.01 | 54.82 ± 0.03 | 59.71 ± 0.01 | 85.10 ± 0.02 | 82.29 ± 0.00 | **57.51 ± 0.02** | **46.06 ± 0.01** |
## Environmental Impact
- **GPU:** 1 x Quadro RTX 8000 (48 GB)
- **Hours used:** 31 hours
- **Compute Region:** Brazil
- **Carbon footprint:** 0.5 Kg
- **Energy**: 6.49 kW
## Citation
```bibtex
@inproceedings{santos2023capivara,
title={CAPIVARA: Cost-Efficient Approach for Improving Multilingual CLIP Performance on Low-Resource Languages},
author={Santos, Gabriel O. dos and Moreira, Diego A. B. and Ferreira, Alef I. and Silva, Jhessica and Pereira, Luiz and Bueno, Pedro and Sousa, Thiago and Maia, Helena and da Silva, N{\'a}dia and Colombini, Esther and Pedrini, Helio and Avila, Sandra},
booktitle = "Workshop on Multi-lingual Representation Learning (MRL), Conference on Empirical Methods in Natural Language Processing (EMNLP)",
year = "2023"
}
``` |