File size: 693 Bytes
16ddc80
86a4b01
d1e7790
86a4b01
1
2
3
4
## Abstract
This project is focused on Mutilingual Image Captioning. Most of the existing datasets and models on this task work with English-only image-text pairs. Our intention here is to provide a Proof-of-Concept with our CLIP Vision + mBART-50 model can be trained on multilingual textual checkpoints with pre-trained image encoders and made to perform well enough.

Due to lack of good-quality multilingual data, we translate subsets of the Conceptual 12M dataset into English (no translation needed), French, German and Spanish using the MarianMT model belonging to the respective language. With better translated captions, and hyperparameter-tuning, we expect to see higher performance.