CLIP-Vision-mBART50 Seq2Seq Encoder-Decoder Model

Pretrained CLIP-Vision-mBART50 pre-trained on subset of translated Conceptual-12M image-text pairs using a seq2seq model training objective. 2.5M cleaned English image-text pairs are translated using Marian Model for respective languages to 2.5M examples each in English, French, German and Spanish. We trained CLIP-Vision-mBART50 model during community week hosted by Huggingface 🤗 using JAX/Flax.

Model description

CLIP-Vision-mBART50 is a modified transformers model which takes in visual embeddings from CLIP-Vision transformer and feeds into the encoder_hidden_states of a mBART50 decoder. This is done for deep cross-modal interaction via cross-attention between the two modes. The decoder then predicts logits for the input_ids provided and can be used for generation.

Intended uses & limitations❗️

You can use the raw model for encoder decoder network where you want the encoder to encode images and decoder to decode text.

Note that this model is primarily aimed at being fine-tuned on tasks like multi-lingual/mono-lingual image captioning.

How to use❓

You will need to clone the model from here. An example of usage is shown below:

from torchvision.io import read_image
import numpy as  np
import os, wget
from transformers import CLIPProcessor, MBart50TokenizerFast
from model.flax_clip_vision_mbart.modeling_clip_vision_mbart import FlaxCLIPVisionMBartForConditionalGeneration
img = wget("http://images.cocodataset.org/val2017/000000397133.jpg")
img = read_image(img) # reading image
clip_processor = CLIPProcessor.from_pretrained('openai/clip-vit-base-patch32')
clip_outputs = clip_processor(images=img)
clip_outputs['pixel_values'][0] = clip_outputs['pixel_values'][0].transpose(1,2,0) # Need to transpose images as model expected channel last images.
tokenizer = MBart50TokenizerFast.from_pretrained('facebook/mbart-large-50"')
model = FlaxCLIPVisionBertForMaskedLM.from_pretrained('flax-community/clip-vit-base-patch32_mbart-large-50')
output_ids = model.generate(batch["pixel_values"], forced_bos_token_id=tokenizer.lang_code_to_id["es_XX"], num_beams=4, max_length=64).sequences  # "es_XX is the language code in which you want the translation
# en_XX: English, fr_XX: French, es_XX: Spanish, de_DE: Deutsch
output_string = tokenizer.batch_decode(output_ids.reshape(-1, 64), skip_special_tokens=True, max_length=64)
output_string  # Un restaurante u otro lugar para comer en el Hotel

Training data 🏋🏻‍♂️

The Multi-lingual image captioning model was trained on a subset of Conceptual 12M dataset by Google:

Conceptual 12M, Introduced by Changpinyo et al. in Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts.

The translated dataset can be downloaded from conceptual-12m-multilingual-marian. We do not provide images as we do not own any of them. One can download images from the image_url section of the original Conceptual 12M dataset.

Data Cleaning 🧹

Though the original dataset contains 12M image-text pairs, a lot of the URLs are invalid now, and in some cases, images are corrupt or broken. We remove such examples from our data, which leaves us with approximately 10M image-text pairs.

Train set:

Total data: 10010625 captions, 2502656 images

Language-wise captions distribution:
English: 2502656
Spanish: 2502656
Deutsch: 2502656
French: 2502656

Validation set

Total data: 110592 captions, 27648 images

Language-wise captions distribution:
English: 27648
Spanish: 27648
Deutsch: 27648
French: 27648

Training procedure 👨🏻‍💻

Training

The model was trained on Google Cloud Engine TPUv3-8 machine (with 335 GB of RAM, 1000 GB of hard drive, 96 CPU cores) 8 v3 TPU cores for 42K steps with a batch size of 128 and a sequence length of 128. The optimizer used is Adam with a learning rate of 3e-4, β1 = 0.9, β2 = 0.98 and ε = 1e-8, a weight decay of 0.01, learning rate warmup for 1,000 steps and linear decay of the learning rate after.

We tracked experiments using Tensorboard which can be found in Training Metrics tab. BLEU scores for languages other than English might be wrongly tracked but the model gives good performance in other languages too as evident from the evaluation scores.

Pretraining Results 📊

Our model reached eval loss of ~2.6 around ~60k steps. Here are the BLEU scores (out of 1) for different languages:

Language	BLEU-1	BLEU-2	BLEU-3	BLEU-4
English	0.13083	0.08887	0.06681	0.04899
Spanish	0.15981	0.09858	0.06918	0.04776
German	0.14234	0.09817	0.07405	0.0515
French	0.13021	0.08862	0.06598	0.04647

Model used: ckpt-51999/

In order to reproduce the results, one can use the evaluation script available in this project's repository.

App Demo

You can try out our model on 🤗 Huggingface's spaces 🪐 : Streamlit app of Multi-lingual Image Captioning model on Huggingface Spaces

Team Members

Bhavitvya Malik @bhavitvyamalik
Gunjan Chhablani @gchhablani

Credits

Thanks to Huggingface 🤗 & Google JAX/FLAX team for such a wonderful community week. Big thanks to @patrickvonplaten and @patil-suraj for helping us with our solution during the community week.

flax-community
/

clip-vit-base-patch32_mbart-large-50