sections/intro/intro_part_1.md · flax-community/Multilingual-VQA at 8aeec323dced2725dc6eceefa8e9b43be6998fa0

Visual Question Answering (VQA) is a task where we expect the AI to answer a question about a given image. VQA has been an active area of research for the past 4-5 years, with most datasets using natural images found online. Two examples of such datasets are: VQAv2, GQA. VQA is a particularly interesting multi-modal machine learning challenge because it has several interesting applications across several domains including healthcare chatbots, interactive-agents, etc. However, most VQA challenges or datasets deal with English-only captions and questions.

In addition, even recent approaches that have been proposed for VQA generally are obscure due to the fact that CNN-based object detectors are relatively difficult to use and more complex for feature extraction. Click on the expandable region below to see steps for FasterRCNN-based approach.