Visual Question Answering (VQA) is a task where we expect the AI to answer a question about a given image. VQA has been an active area of research for the past 4-5 years, with most datasets using natural images found online. Two examples of such datasets are: [VQAv2](https://visualqa.org/challenge.html), [GQA](https://cs.stanford.edu/people/dorarad/gqa/about.html). VQA is a particularly interesting multi-modal machine learning challenge because it has several interesting applications across several domains including healthcare chatbots, interactive-agents, etc. **However, most VQA challenges or datasets deal with English-only captions and questions.** In addition, even recent **approaches that have been proposed for VQA generally are obscure** due to the fact that CNN-based object detectors are relatively difficult to use and more complex for feature extraction. For example, a FasterRCNN approach uses the following steps: - the image features are given out by a FPN (Feature Pyramid Net) over a ResNet backbone, and - then a RPN (Regision Proposal Network) layer detects proposals in those features, and - then the ROI (Region of Interest) heads get the box proposals in the original image, and - the the boxes are selected using a NMS (Non-max suppression), - and then the features for selected boxes are used as visual features. A major **advantage that comes from using transformers is their simplicity and their accessibility** - thanks to HuggingFace, ViT and Transformers. For ViT models, for example, all one needs to do is pass the normalized images to the transformer. While building a low-resource non-English VQA approach has several benefits of its own, a multilingual VQA task is interesting because it will help create a generic model that works well across several languages. And then, it can be fine-tuned in low-resource settings to leverage pre-training improvements. **With the aim of democratizing such a challenging yet interesting task, in this project, we focus on Mutilingual Visual Question Answering (MVQA)**. Our intention here is to provide a Proof-of-Concept with our simple CLIP-Vision-BERT baseline which leverages a multilingual checkpoint with pre-trained image encoders. Our model currently supports for four languages - **English, French, German and Spanish**. We follow the two-staged training approach, our pre-training task being text-only Masked Language Modeling (MLM). Our pre-training dataset comes from Conceptual-12M dataset where we use mBART-50 for translation. Our fine-tuning dataset is taken from the VQAv2 dataset and its translation is done using MarianMT models. Our checkpoints achieve a **validation accuracy of 0.69 on our MLM** task, while our fine-tuned model is able to achieve a **validation accuracy of 0.49 on our multilingual VQAv2 validation set**. With better captions, hyperparameter-tuning, and further training, we expect to see higher performance.