CAPIVARA / README.md

Update README.md

bacdc0c about 1 year ago

6.83 kB

	---
	license: mit
	language:
	- pt
	tags:
	- CAPIVARA
	- Portuguese CLIP
	- OpenCLIP
	datasets:
	- conceptual_captions
	- PraCegoVer
	- MS_COCO
	- Flickr30K
	- ImageNet
	- ELEVATER
	---
	# Model Card for CAPIVARA

	CAPIVARA is a cost-efficient framework designed to enhance the performance of multilingual CLIP models in low-resource languages.
	This model holds the state of the art in many zero-shot tasks involving images and Portuguese texts.

	## How to use
	```python
	from utils.capivara_utils import download_pretrained_from_hf

	model_path = download_pretrained_from_hf(model_id="hiaac-nlp/CAPIVARA")
	model = OpenCLIPWrapper.load_from_checkpoint(model_path, strict=False).model
	```

	For more details refer to [Github repo](https://github.com/hiaac-nlp/CAPIVARA/).


	## Model Details

	### Model Description

	CAPIVARA is built upon pre-trained [OpenCLIP ViT-B/32 XLM-Roberta Base](https://huggingface.co/laion/CLIP-ViT-B-32-xlm-roberta-base-laion5B-s13B-b90k) and
	fine-tuned with [Conceptual Captions](https://aclanthology.org/P18-1238.pdf) and synthetic captions generated by [BLIP2](https://huggingface.co/Salesforce/blip2-opt-2.7b-coco).
	All the captions are translated with Google Translator.


	## Uses

	### Direct Use

	Zero-shot image classification, zero-shot image and text retrieval, etc.

	### Downstream Use

	Image classification and other image task fine-tuning, linear probe image classification,
	image captioning, image generation guiding and conditioning, etc.


	## Ethical considerations

	For ethical considerations, please, refer to the Model Cards section in the [paper](https://arxiv.org/abs/2310.13683).

	## Training Details

	### Training Data
	The model was fine-tuned with [Conceptual Captions](https://aclanthology.org/P18-1238.pdf) and synthetic captions generated by [BLIP2](https://huggingface.co/Salesforce/blip2-opt-2.7b-coco).
	All the captions are translated with Google Translator.

	#### Training Hyperparameters
	```
	Optimizer: "Adam"
	eps: 1e-8
	weight_decay: 0.2
	betas: [ 0.9, 0.98 ]

	LR_scheduler: "CosineWarmupLR"
	min_learning_rate: 1e-7
	max_learning_rate: 5e-7
	warmup_lr: 500

	batch_size: 2816
	max_steps: 5863 # 10 epochs
	```

	## Evaluation

	+ [Zero-shot image classification](https://github.com/hiaac-nlp/CAPIVARA/blob/main/clip_pt/src/evaluate/capivara_classification.ipynb)
	+ [Zero-shot cross-modal retrieval](https://github.com/hiaac-nlp/CAPIVARA/blob/main/clip_pt/src/evaluate/capivara_retrieval.ipynb)


	### Testing Data, Factors & Metrics

	#### Testing Data

	For cross-modal retrieval, we used [PraCegoVer](https://www.mdpi.com/2306-5729/7/2/13), which is composed of images annotated originally
	with Portuguese texts, and our Portuguese-translated versions of [MS COCO](https://link.springer.com/chapter/10.1007/978-3-319-10602-1_48)
	and [Flickr30k](https://openaccess.thecvf.com/content_iccv_2015/papers/Plummer_Flickr30k_Entities_Collecting_ICCV_2015_paper.pdf).
	We also translate the labels from [ImageNet](https://ieeexplore.ieee.org/document/5206848) and the
	[ELEVATER](https://proceedings.neurips.cc/paper_files/paper/2022/hash/3c4688b6a76f25f2311daa0d75a58f1a-Abstract-Datasets_and_Benchmarks.html)
	benchmark datasets for image classification.

	### Results

	#### Zero-shot Cross-Modal Retrieval

	We conducted zero-shot cross-modal retrieval experiments on Flickr30k and MS COCO with captions
	translated into Portuguese, and PraCegoVer. We report the average and standard deviation for 3 runs.

	<table>
	<thead>
	<tr>
	<th>Models</th>
	<th colspan="2">Flickr30k</th>
	<th colspan="2"> MS COCO</th>
	<th colspan="2">PraCegoVer</th>
	</tr>
	</thead>
	<tbody>
	<tr>
	<td></td>
	<td>text-to-image</td>
	<td> image-to-text</td>
	<td>text-to-image</td>
	<td> image-to-text</td>
	<td>text-to-image</td>
	<td> image-to-text</td>
	</tr>
	<tr>
	<td>OpenCLIP ViT-B/32 XLM-Roberta Base (Baseline)</td>
	<td>76.23</td>
	<td>87.93</td>
	<td>52.62</td>
	<td>66.55</td>
	<td>65.36</td>
	<td><b>69.43</b></td>
	</tr>
	<tr>
	<td>CAPIVARA</td>
	<td><b>79.56 ± 0.01</b></td>
	<td><b>89.95 ± 0.04</b></td>
	<td><b>56.27 ± 0.01</b></td>
	<td><b>71.24 ± 0.01</b></td>
	<td><b>66.40 ± 0.01</b></td>
	<td>64.75 ± 0.01</td>
	</tr>
	</tbody>
	</table>

	#### Zero-shot image classification

	\| Models \| Caltech-101 \| CIFAR-10 \| CIFAR-100 \| Country-211 \| DTD \| EuroSAT \| FER-2013 \| FGVC-Aircraft \| Food-101 \| GTSRB \| Hateful-Memes \| KITTI-Distance \| MNIST \| Oxford Flowers-102 \| Oxford-IIIT Pets \| PatchCamelyon \| Rendered-SST2 \| RESISC-45 \| Stanford-Cars \| PASCAL VOC-2007 \| Average \| ImageNet-1k \|
	\|:-----------------------:\|:---------------:\|:------------:\|:-------------:\|:---------------:\|:------------:\|:------------:\|:------------:\|:-----------------:\|:------------:\|:------------:\|:-----------------:\|:------------------:\|:------------:\|:----------------------:\|:--------------------:\|:-----------------:\|:-----------------:\|:-------------:\|:-----------------:\|:-------------------:\|:------------:\|:---------------:\|
	\| OpenCLIP ViT-B/32 XLM-Roberta Base (Baseline) \| 84.53 ± 0.00 \| 93.99 ± 0.00 \| 68.44 ± 0.00 \| 17.82 ± 0.00 \| 41.17 ± 0.00 \| 47.16 ± 0.00 \| 48.65 ± 0.00 \| 26.30 ± 0.00 \| 65.06 ± 0.00 \| 43.27 ± 0.00 \| 56.50 ± 0.00 \| 28.41 ± 0.00 \| 54.99 ± 0.00 \| 50.88 ± 0.00 \| 81.56 ± 0.00 \| 50.96 ± 0.00 \| 54.20 ± 0.00 \| 58.51 ± 0.00 \| 84.93 ± 0.00 \| 82.09 ± 0.00 \| 56.97 ± 0.00 \| 45.84 ± 0.00 \|
	\| CAPIVARA \| 82.97 ± 0.03 \| 93.85 ± 0.00 \| 69.37 ± 0.01 \| 17.61 ± 0.00 \| 42.34 ± 0.04 \| 47.77 ± 0.02 \| 46.68 ± 0.05 \| 25.49 ± 0.01 \| 64.58 ± 0.01 \| 46.34 ± 0.01 \| 56.17 ± 0.00 \| 33.94 ± 0.13 \| 60.14 ± 0.04 \| 49.93 ± 0.02 \| 79.37 ± 0.00 \| 51.71 ± 0.01 \| 54.82 ± 0.03 \| 59.71 ± 0.01 \| 85.10 ± 0.02 \| 82.29 ± 0.00 \| 57.51 ± 0.02 \| 46.06 ± 0.01 \|

	## Environmental Impact

	- GPU: 1 x Quadro RTX 8000 (48 GB)
	- Hours used: 31 hours
	- Compute Region: Brazil
	- Carbon footprint: 0.5 Kg
	- Energy: 6.49 kW


	## Citation

	```bibtex
	@inproceedings{santos2023capivara,
	title={CAPIVARA: Cost-Efficient Approach for Improving Multilingual CLIP Performance on Low-Resource Languages},
	author={Santos, Gabriel O. dos and Moreira, Diego A. B. and Ferreira, Alef I. and Silva, Jhessica and Pereira, Luiz and Bueno, Pedro and Sousa, Thiago and Maia, Helena and da Silva, N{\'a}dia and Colombini, Esther and Pedrini, Helio and Avila, Sandra},
	booktitle = "Workshop on Multi-lingual Representation Learning (MRL), Conference on Empirical Methods in Natural Language Processing (EMNLP)",
	year = "2023"
	}
	```