Spaces:
Runtime error
Runtime error
Our novel contributions include:
- A multilingual variant of the Conceptual-12M dataset (mBART50) containing 2.5M image-text pairs each in four languages - English, French, German and Spanish, translated using mBART-50 model.
- A multilingual variant of the Conceptual-12M dataset (MarianMT) containing 2.5M image-text pairs each in four languages - English, French, German and Spanish, translated using MarianMT model.
- A fusion of CLIP Vision Transformer and mBART50 model. It takes in visual embeddings from CLIP-Vision transformer and feeds into the
encoder_hidden_states
of a mBART50 decoder. This is done for deep cross-modal interaction via cross-attention between the two models. - A pre-trained checkpooint on our multilingual Conceptual-12M variant.