## Abstract This project is focused on Mutilingual Visual Question Answering. Most of the existing datasets and models on this task work with English-only image-text pairs. Our intention here is to provide a Proof-of-Concept with our simple CLIP Vision + BERT model which can be trained on multilingual text checkpoints with pre-trained image encoders and made to perform well enough. Due to lack of good-quality multilingual data, we translate subsets of the Conceptual 12M dataset into English (already in English), French, German and Spanish using the mBART-50 models. We get an eval accuracy of 0.69 on the MLM task. We achieved 0.49 accuracy on the multilingual validation set of VQAv2 we created using Marian models. With better captions, and hyperparameter-tuning, we expect to see higher performance.