CLIP-Vision-mBART50 Seq2Seq Encoder-Decoder Model

Pretrained CLIP-Vision-mBART50 pre-trained on subset of translated Conceptual-12M image-text pairs using a seq2seq model training objective. 2.5M cleaned English image-text pairs are translated using Marian Model for respective languages to 2.5M examples each in English, French, German and Spanish. We trained CLIP-Vision-mBART50 model during community week hosted by Huggingface 🤗 using JAX/Flax.

Model description

CLIP-Vision-mBART50 is a modified mBART50 model which takes in visual embeddings from CLIP-Vision transformer and concatenates them with mBART textual embeddings before passing them to the self-attention layers of mBART. This is done for deep cross-modal interaction between the two modes.

Intended uses & limitations❗️

You can use the raw model for encoder decoder network where you want the encoder to encode images and decoder to decode text.

Note that this model is primarily aimed at being fine-tuned on tasks like multi-lingual/mono-lingual image captioning.

How to use❓

You will need to clone the model from here. An example of usage is shown below:

>>> from torchvision.io import read_image
>>> import numpy as  np
>>> import os
>>> from transformers import CLIPProcessor, MBart50TokenizerFast
>>> from model.flax_clip_vision_mbart.modeling_clip_vision_mbart import FlaxCLIPVisionMBartForConditionalGeneration
>>> image_path = os.path.join('images/val2014', os.listdir('images/val2014')[0])
>>> img = read_image(image_path) # reading image
>>> clip_processor = CLIPProcessor.from_pretrained('openai/clip-vit-base-patch32')
>>> clip_outputs = clip_processor(images=img)
>>> clip_outputs['pixel_values'][0] = clip_outputs['pixel_values'][0].transpose(1,2,0) # Need to transpose images as model expected channel last images.
>>> tokenizer = MBart50TokenizerFast.from_pretrained('facebook/mbart-large-50"')
>>> model = FlaxCLIPVisionBertForMaskedLM.from_pretrained('flax-community/clip-vit-base-patch32_mbart-large-50')
>>> output_ids = model.generate(batch["pixel_values"], forced_bos_token_id=tokenizer.lang_code_to_id["en_XX"], num_beams=4, max_length=64).sequences  # "en_XX is the language code in which you want the translation
# en_XX: English, fr_XX: French, es_XX: Spanish, de_DE: Deutsch
>>> output_string = tokenizer.batch_decode(output_ids.reshape(-1, 64), skip_special_tokens=True, max_length=64)
>>> output_string # relevant caption

Training data 🏋🏻‍♂️

The Multi-lingual image captioning model was trained on a subset of Conceptual 12M dataset by Google:

Conceptual 12M, Introduced by Changpinyo et al. in Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts.

The translated dataset can be downloaded from conceptual-12m-multilingual-marian. We do not provide images as we do not own any of them. One can download images from the image_url section of the original Conceptual 12M dataset.

Data Cleaning 🧹

Though the original dataset contains 12M image-text pairs, a lot of the URLs are invalid now, and in some cases, images are corrupt or broken. We remove such examples from our data, which leaves us with approximately 10M image-text pairs.

Train set:

Total data:
10010625 captions
2502656 images br>

Language-wise distribution:
English: 2502656 captions
Spanish: 2502656 captions
Deutsch: 2502656 captions
French: 2502656 captions

Validation set

Total data:
110592 captions
27648 images

Language-wise distribution:
English: 27648 captions
Spanish: 27648 captions
Deutsch: 27648 captions
French: 27648 captions

Training procedure 👨🏻‍💻

Training

The model was trained on Google Cloud Engine TPUv3-8 machine (with 335 GB of RAM, 1000 GB of hard drive, 96 CPU cores) 8 v3 TPU cores for 42K steps with a batch size of 128 and a sequence length of 128. The optimizer used is Adam with a learning rate of 3e-4, β1 = 0.9, β2 = 0.98 and ε = 1e-8, a weight decay of 0.01, learning rate warmup for 1,000 steps and linear decay of the learning rate after.

We tracked experiments using Tensorboard which can be found in Training Metrics tab.

Pretraining Results 📊

Our model reached eval loss of ~2.6 around ~70K steps. Here are the BLEU^ scores for different languages:

Language	BLEU-1	BLEU-2	BLEU-3	BLEU-4
English	0.163	0.127	0.10	0.081
Spanish	0.171	0.133	0.114	0.082
German	0.165	0.129	0.103	0.077
French	0.162	0.124	0.104	0.073

^BLEU scores are out of 1

App Demo

You can try out our model on 🤗 Huggingface's spaces 🪐 : Streamlit app of Multi-lingual Image Captioning model on Huggingface Spaces

Team Members

Bhavitvya Malik @bhavitvyamalik
Gunjan Chhablani @gchhablani

Credits

Thanks to Huggingface 🤗 & Google JAX/FLAX team for such a wonderful community week. Big thanks to @patrickvonplaten and @patil-suraj for helping us with our solution during the community week.