Our novel contributions include: - A [multilingual variant of the Conceptual-12M dataset (mBART50)](https://huggingface.co/datasets/flax-community/conceptual-12m-mbart-50-multilingual) containing 2.5M image-text pairs each in four languages - English, French, German and Spanish, translated using mBART-50 model. - A [multilingual variant of the Conceptual-12M dataset (MarianMT)](https://huggingface.co/datasets/flax-community/conceptual-12m-multilingual-marian) containing 2.5M image-text pairs each in four languages - English, French, German and Spanish, translated using MarianMT model. - [A fusion of CLIP Vision Transformer and mBART50 model](https://github.com/gchhablani/multilingual-vqa/tree/main/models/flax_clip_vision_bert). It takes in visual embeddings from CLIP-Vision transformer and feeds into the `encoder_hidden_states` of a mBART50 decoder. This is done for deep cross-modal interaction via cross-attention between the two models. - A [pre-trained checkpooint](https://huggingface.co/flax-community/clip-vit-base-patch32_mbart-large-50) on our multilingual Conceptual-12M variant.