license: mit
language:
- pt
tags:
- CAPIVARA
- Portuguese CLIP
- OpenCLIP
datasets:
- conceptual_captions
- PraCegoVer
- MS_COCO
- Flickr30K
- ImageNet
- ELEVATER
Model Card for CAPIVARA
CAPIVARA is a cost-efficient framework designed to enhance the performance of multilingual CLIP models in low-resource languages. This model holds the state of the art in many zero-shot tasks involving images and Portuguese texts.
How to use
from models.open_clip_wrapper import OpenCLIPWrapper
from utils.capivara_utils import download_pretrained_from_hf
model_path = download_pretrained_from_hf(model_id="hiaac-nlp/CAPIVARA")
model = OpenCLIPWrapper.load_from_checkpoint(model_path, strict=False).model
For more details refer to Github repo.
Model Details
Model Description
CAPIVARA is built upon pre-trained OpenCLIP ViT-B/32 XLM-Roberta Base and fine-tuned with Conceptual Captions and synthetic captions generated by BLIP2. All the captions are translated with Google Translator.
Uses
Direct Use
Zero-shot image classification, zero-shot image and text retrieval, etc.
Downstream Use
Image classification and other image task fine-tuning, linear probe image classification, image captioning, image generation guiding and conditioning, etc.
Ethical considerations
For ethical considerations, please, refer to the Model Cards section in the paper.
Training Details
Training Data
The model was fine-tuned with Conceptual Captions and synthetic captions generated by BLIP2. All the captions are translated with Google Translator.
Training Hyperparameters
Optimizer: "Adam"
eps: 1e-8
weight_decay: 0.2
betas: [ 0.9, 0.98 ]
LR_scheduler: "CosineWarmupLR"
min_learning_rate: 1e-7
max_learning_rate: 5e-7
warmup_lr: 500
batch_size: 2816
max_steps: 5863 # 10 epochs
Evaluation
Testing Data, Factors & Metrics
Testing Data
For cross-modal retrieval, we used PraCegoVer, which is composed of images annotated originally
with Portuguese texts, and our Portuguese-translated versions of MS COCO
and Flickr30k.
We also translate the labels from ImageNet and the
ELEVATER
benchmark datasets for image classification.
Results
Zero-shot Cross-Modal Retrieval
We conducted zero-shot cross-modal retrieval experiments on Flickr30k and MS COCO with captions translated into Portuguese, and PraCegoVer. We report the average and standard deviation for 3 runs.
Models | Flickr30k | MS COCO | PraCegoVer | |||
---|---|---|---|---|---|---|
text-to-image | image-to-text | text-to-image | image-to-text | text-to-image | image-to-text | |
OpenCLIP ViT-B/32 XLM-Roberta Base (Baseline) | 76.23 | 87.93 | 52.62 | 66.55 | 65.36 | 69.43 |
CAPIVARA | 79.56 ± 0.01 | 89.95 ± 0.04 | 56.27 ± 0.01 | 71.24 ± 0.01 | 66.40 ± 0.01 | 64.75 ± 0.01 |
Zero-shot image classification
Models | Caltech-101 | CIFAR-10 | CIFAR-100 | Country-211 | DTD | EuroSAT | FER-2013 | FGVC-Aircraft | Food-101 | GTSRB | Hateful-Memes | KITTI-Distance | MNIST | Oxford Flowers-102 | Oxford-IIIT Pets | PatchCamelyon | Rendered-SST2 | RESISC-45 | Stanford-Cars | PASCAL VOC-2007 | Average | ImageNet-1k |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
OpenCLIP ViT-B/32 XLM-Roberta Base (Baseline) | 84.53 ± 0.00 | 93.99 ± 0.00 | 68.44 ± 0.00 | 17.82 ± 0.00 | 41.17 ± 0.00 | 47.16 ± 0.00 | 48.65 ± 0.00 | 26.30 ± 0.00 | 65.06 ± 0.00 | 43.27 ± 0.00 | 56.50 ± 0.00 | 28.41 ± 0.00 | 54.99 ± 0.00 | 50.88 ± 0.00 | 81.56 ± 0.00 | 50.96 ± 0.00 | 54.20 ± 0.00 | 58.51 ± 0.00 | 84.93 ± 0.00 | 82.09 ± 0.00 | 56.97 ± 0.00 | 45.84 ± 0.00 |
CAPIVARA | 82.97 ± 0.03 | 93.85 ± 0.00 | 69.37 ± 0.01 | 17.61 ± 0.00 | 42.34 ± 0.04 | 47.77 ± 0.02 | 46.68 ± 0.05 | 25.49 ± 0.01 | 64.58 ± 0.01 | 46.34 ± 0.01 | 56.17 ± 0.00 | 33.94 ± 0.13 | 60.14 ± 0.04 | 49.93 ± 0.02 | 79.37 ± 0.00 | 51.71 ± 0.01 | 54.82 ± 0.03 | 59.71 ± 0.01 | 85.10 ± 0.02 | 82.29 ± 0.00 | 57.51 ± 0.02 | 46.06 ± 0.01 |
Environmental Impact
- GPU: 1 x Quadro RTX 8000 (48 GB)
- Hours used: 31 hours
- Compute Region: Brazil
- Carbon footprint: 0.5 Kg
- Energy: 6.49 kW
Citation
@inproceedings{santos2023capivara,
title={CAPIVARA: Cost-Efficient Approach for Improving Multilingual CLIP Performance on Low-Resource Languages},
author={Santos, Gabriel O. dos and Moreira, Diego A. B. and Ferreira, Alef I. and Silva, Jhessica and Pereira, Luiz and Bueno, Pedro and Sousa, Thiago and Maia, Helena and da Silva, N{\'a}dia and Colombini, Esther and Pedrini, Helio and Avila, Sandra},
booktitle = "Workshop on Multi-lingual Representation Learning (MRL), Conference on Empirical Methods in Natural Language Processing (EMNLP)",
year = "2023"
}