In this project, we presented Proof-of-Concept with our CLIP Vision + mBART-50 model baseline which leverages a multilingual checkpoint with pre-trained image encoders in four languages - **English, French, German, and Spanish**. Our models achieve a BLEU-1 score of around 0.14 which is decent considering the amount of training time we could get and how challenging multilingual training is.