# Abstract This project is focused on Mutilingual Image Captioning. Most of the existing datasets and models on this task work with English-only image-text pairs. Our intention here is to provide a Proof-of-Concept with our CLIP Vision + mBART-50 model can be trained on multilingual textual checkpoints with pre-trained image encoders and made to perform well enough. Due to lack of good-quality multilingual data, we translate subsets of the Conceptual 12M dataset into English (no translation needed), French, German and Spanish using the mBART large `one-to-many` model. With better translated captions, and hyperparameter-tuning, we expect to see higher performance.