Spaces:
Runtime error
Runtime error
gchhablani
commited on
Commit
β’
c5c6018
1
Parent(s):
287b7cd
Update intro
Browse files- apps/article.py +1 -1
- sections/checkpoints.md +1 -1
- sections/{abstract.md β intro.md} +14 -5
apps/article.py
CHANGED
@@ -2,7 +2,7 @@ import streamlit as st
|
|
2 |
from apps.utils import read_markdown
|
3 |
|
4 |
def app(state):
|
5 |
-
st.write(read_markdown("
|
6 |
st.write("## Methodology")
|
7 |
col1, col2 = st.beta_columns([1,2])
|
8 |
col1.image(
|
|
|
2 |
from apps.utils import read_markdown
|
3 |
|
4 |
def app(state):
|
5 |
+
st.write(read_markdown("intro.md"))
|
6 |
st.write("## Methodology")
|
7 |
col1, col2 = st.beta_columns([1,2])
|
8 |
col1.image(
|
sections/checkpoints.md
CHANGED
@@ -8,4 +8,4 @@
|
|
8 |
- Fine-tuned checkpoints on 45k pre-trained checkpoint with AdaFactor (others use AdamW): [multilingual-vqa-pt-45k-ft-adf](https://huggingface.co/flax-community/multilingual-vqa-pt-45k-ft-adf)
|
9 |
- Fine-tuned checkpoints on 60k pre-trained checkpoint: [multilingual-vqa-pt-60k-ft](https://huggingface.co/flax-community/multilingual-vqa-pt-60k-ft)
|
10 |
- Fine-tuned checkpoints on 70k pre-trained checkpoint: [multilingual-vqa-pt-60k-ft](https://huggingface.co/flax-community/multilingual-vqa-pt-70k-ft)
|
11 |
-
- From scratch (without pre-training) model: [multilingual-vqa-ft](https://huggingface.co/flax-community/multilingual-vqa-ft)
|
|
|
8 |
- Fine-tuned checkpoints on 45k pre-trained checkpoint with AdaFactor (others use AdamW): [multilingual-vqa-pt-45k-ft-adf](https://huggingface.co/flax-community/multilingual-vqa-pt-45k-ft-adf)
|
9 |
- Fine-tuned checkpoints on 60k pre-trained checkpoint: [multilingual-vqa-pt-60k-ft](https://huggingface.co/flax-community/multilingual-vqa-pt-60k-ft)
|
10 |
- Fine-tuned checkpoints on 70k pre-trained checkpoint: [multilingual-vqa-pt-60k-ft](https://huggingface.co/flax-community/multilingual-vqa-pt-70k-ft)
|
11 |
+
- From scratch (without MLM pre-training) model: [multilingual-vqa-ft](https://huggingface.co/flax-community/multilingual-vqa-ft)
|
sections/{abstract.md β intro.md}
RENAMED
@@ -1,18 +1,27 @@
|
|
1 |
-
|
|
|
2 |
|
3 |
-
In addition, even recent **approaches that have been proposed for VQA generally are obscure** due to the reasons that CNN-based object detectors are relatively difficult and more complex. For example, a FasterRCNN approach uses the following steps:
|
4 |
- a FPN (Feature Pyramid Net) over a ResNet backbone, and
|
5 |
- then a RPN (Regision Proposal Network) layer detects proposals in those features, and
|
6 |
- then the ROI (Region of Interest) heads get the box proposals in the original image, and
|
7 |
- the the boxes are selected using a NMS (Non-max suppression),
|
8 |
-
- and then the features for selected boxes.
|
9 |
|
10 |
A major **advantage that comes from using transformers is their simplicity and their accessibility** - thanks to HuggingFace team, ViT and Transformers authors. For ViT models, for example, all one needs to do is pass the normalized images to the transformer.
|
11 |
|
12 |
While building a low-resource non-English VQA approach has several benefits of its own, a multilingual VQA task is interesting because it will help create a generic approach/model that works decently well across several languages.
|
13 |
|
14 |
-
With the aim of democratizing such an challenging yet interesting task, in this project, we focus on Mutilingual Visual Question Answering (MVQA)
|
15 |
|
16 |
We follow the two-staged training approach, our pre-training task being text-only Masked Language Modeling (MLM). Our pre-training dataset comes from Conceptual-12M dataset where we use mBART-50 for translation. Our fine-tuning dataset is taken from the VQAv2 dataset and its translation is done using MarianMT models.
|
17 |
|
18 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
## Introduction
|
2 |
+
Visual Question Answering (VQA) is a task where we expect the AI to answer a question about a given image. VQA has been an active area of research for the past 4-5 years, with most datasets using natural images found online. Two examples of such datasets are: [VQAv2](https://visualqa.org/challenge.html), [GQA](https://cs.stanford.edu/people/dorarad/gqa/about.html). VQA is a particularly interesting multi-modal machine learning challenge because it has several interesting applications across several domains including healthcare chatbots, interactive-agents, etc. **However, most VQA challenges or datasets deal with English-only captions and questions.**
|
3 |
|
4 |
+
In addition, even recent **approaches that have been proposed for VQA generally are obscure** due to the reasons that CNN-based object detectors are relatively difficult to use and more complex. For example, a FasterRCNN approach uses the following steps:
|
5 |
- a FPN (Feature Pyramid Net) over a ResNet backbone, and
|
6 |
- then a RPN (Regision Proposal Network) layer detects proposals in those features, and
|
7 |
- then the ROI (Region of Interest) heads get the box proposals in the original image, and
|
8 |
- the the boxes are selected using a NMS (Non-max suppression),
|
9 |
+
- and then the features for selected boxes are used as visual features.
|
10 |
|
11 |
A major **advantage that comes from using transformers is their simplicity and their accessibility** - thanks to HuggingFace team, ViT and Transformers authors. For ViT models, for example, all one needs to do is pass the normalized images to the transformer.
|
12 |
|
13 |
While building a low-resource non-English VQA approach has several benefits of its own, a multilingual VQA task is interesting because it will help create a generic approach/model that works decently well across several languages.
|
14 |
|
15 |
+
**With the aim of democratizing such an challenging yet interesting task, in this project, we focus on Mutilingual Visual Question Answering (MVQA)**. Our intention here is to provide a Proof-of-Concept with our simple CLIP Vision + BERT baseline which leverages a multilingual checkpoint with pre-trained image encoders. Our model currently supports for four languages - **English, French, German and Spanish**.
|
16 |
|
17 |
We follow the two-staged training approach, our pre-training task being text-only Masked Language Modeling (MLM). Our pre-training dataset comes from Conceptual-12M dataset where we use mBART-50 for translation. Our fine-tuning dataset is taken from the VQAv2 dataset and its translation is done using MarianMT models.
|
18 |
|
19 |
+
Our checkpoints achieve a **validation accuracy of 0.69 on our MLM** task, while our fine-tuned model is able to achieve a **validation accuracy of 0.49 on our multilingual VQAv2 validation set**. With better captions, hyperparameter-tuning, and further training, we expect to see higher performance.
|
20 |
+
|
21 |
+
**Novel Contributions**:
|
22 |
+
Our novel contributions include:
|
23 |
+
- A [multilingual variant of the Conceptual-12M dataset](https://huggingface.co/datasets/flax-community/conceptual-12m-mbart-50-multilingual) containing 2.5M image-text pairs each in four languages - English, French, German and Spanish, translated using mBART-50 model.
|
24 |
+
- [Multilingual variants of the VQAv2 train and validation sets](https://huggingface.co/datasets/flax-community/multilingual-vqa) containing four times the original data in English, French, German and Spanish, translated using Marian models.
|
25 |
+
- [A fusion of CLIP Vision Transformer and BERT model](https://github.com/gchhablani/multilingual-vqa/tree/main/models/flax_clip_vision_bert) where BERT embeddings are concatenated with visual embeddings at the very beginning and passed through BERT self-attention layers. This is based on the [VisualBERT](https://arxiv.org/abs/1908.03557) model.
|
26 |
+
- A [pre-trained checkpooint](https://huggingface.co/flax-community/clip-vision-bert-cc12m-70k) on our multilingual with **0.69** validation accuracy.
|
27 |
+
- A [fine-tuned checkpoint](https://huggingface.co/flax-community/clip-vision-bert-vqa-ft-6k) on our multilingual variant of the VQAv2 dataset with **0.49** validation accuracy.
|