sections/intro.md · flax-community/Multilingual-VQA at 1efe7581a2e07c9c486eb8bb1d23722dd85748dc

Introduction and Motivation

Visual Question Answering (VQA) is a task where we expect the AI to answer a question about a given image. VQA has been an active area of research for the past 4-5 years, with most datasets using natural images found online. Two examples of such datasets are: VQAv2, GQA. VQA is a particularly interesting multi-modal machine learning challenge because it has several interesting applications across several domains including healthcare chatbots, interactive-agents, etc. However, most VQA challenges or datasets deal with English-only captions and questions.

In addition, even recent approaches that have been proposed for VQA generally are obscure due to the fact that CNN-based object detectors are relatively difficult to use and more complex for feature extraction. For example, a FasterRCNN approach uses the following steps:

a FPN (Feature Pyramid Net) over a ResNet backbone, and
then a RPN (Regision Proposal Network) layer detects proposals in those features, and
then the ROI (Region of Interest) heads get the box proposals in the original image, and
the the boxes are selected using a NMS (Non-max suppression),
and then the features for selected boxes are used as visual features.

A major advantage that comes from using transformers is their simplicity and their accessibility - thanks to HuggingFace team, ViT and Transformers authors. For ViT models, for example, all one needs to do is pass the normalized images to the transformer.

While building a low-resource non-English VQA approach has several benefits of its own, a multilingual VQA task is interesting because it will help create a generic approach/model that works decently well across several languages.

With the aim of democratizing such an challenging yet interesting task, in this project, we focus on Mutilingual Visual Question Answering (MVQA). Our intention here is to provide a Proof-of-Concept with our simple CLIP-Vision-BERT baseline which leverages a multilingual checkpoint with pre-trained image encoders. Our model currently supports for four languages - English, French, German and Spanish.

We follow the two-staged training approach, our pre-training task being text-only Masked Language Modeling (MLM). Our pre-training dataset comes from Conceptual-12M dataset where we use mBART-50 for translation. Our fine-tuning dataset is taken from the VQAv2 dataset and its translation is done using MarianMT models.

Our checkpoints achieve a validation accuracy of 0.69 on our MLM task, while our fine-tuned model is able to achieve a validation accuracy of 0.49 on our multilingual VQAv2 validation set. With better captions, hyperparameter-tuning, and further training, we expect to see higher performance.

Novel Contributions

Our novel contributions include:

A multilingual variant of the Conceptual-12M dataset containing 2.5M image-text pairs each in four languages - English, French, German and Spanish, translated using mBART-50 model.
Multilingual variants of the VQAv2 train and validation sets containing four times the original data in English, French, German and Spanish, translated using Marian models.
A fusion of CLIP Vision Transformer and BERT model where BERT embeddings are concatenated with visual embeddings at the very beginning and passed through BERT self-attention layers. This is based on the VisualBERT model.
A pre-trained checkpooint on our multilingual with 0.69 validation accuracy.
A fine-tuned checkpoint on our multilingual variant of the VQAv2 dataset with 0.49 validation accuracy.