gchhablani's picture
Update intro title
9aabff8
|
raw
history blame
No virus
4.12 kB

Introduction and Motivation

Visual Question Answering (VQA) is a task where we expect the AI to answer a question about a given image. VQA has been an active area of research for the past 4-5 years, with most datasets using natural images found online. Two examples of such datasets are: VQAv2, GQA. VQA is a particularly interesting multi-modal machine learning challenge because it has several interesting applications across several domains including healthcare chatbots, interactive-agents, etc. However, most VQA challenges or datasets deal with English-only captions and questions.

In addition, even recent approaches that have been proposed for VQA generally are obscure due to the fact that CNN-based object detectors are relatively difficult to use and more complex for feature extraction. For example, a FasterRCNN approach uses the following steps:

  • a FPN (Feature Pyramid Net) over a ResNet backbone, and
  • then a RPN (Regision Proposal Network) layer detects proposals in those features, and
  • then the ROI (Region of Interest) heads get the box proposals in the original image, and
  • the the boxes are selected using a NMS (Non-max suppression),
  • and then the features for selected boxes are used as visual features.

A major advantage that comes from using transformers is their simplicity and their accessibility - thanks to HuggingFace team, ViT and Transformers authors. For ViT models, for example, all one needs to do is pass the normalized images to the transformer.

While building a low-resource non-English VQA approach has several benefits of its own, a multilingual VQA task is interesting because it will help create a generic approach/model that works decently well across several languages.

With the aim of democratizing such an challenging yet interesting task, in this project, we focus on Mutilingual Visual Question Answering (MVQA). Our intention here is to provide a Proof-of-Concept with our simple CLIP-Vision-BERT baseline which leverages a multilingual checkpoint with pre-trained image encoders. Our model currently supports for four languages - English, French, German and Spanish.

We follow the two-staged training approach, our pre-training task being text-only Masked Language Modeling (MLM). Our pre-training dataset comes from Conceptual-12M dataset where we use mBART-50 for translation. Our fine-tuning dataset is taken from the VQAv2 dataset and its translation is done using MarianMT models.

Our checkpoints achieve a validation accuracy of 0.69 on our MLM task, while our fine-tuned model is able to achieve a validation accuracy of 0.49 on our multilingual VQAv2 validation set. With better captions, hyperparameter-tuning, and further training, we expect to see higher performance.

Novel Contributions

Our novel contributions include: