gchhablani commited on
Commit
287b7cd
1 Parent(s): 70de33c

Update article

Browse files
apps/article.py CHANGED
@@ -3,9 +3,8 @@ from apps.utils import read_markdown
3
 
4
  def app(state):
5
  st.write(read_markdown("abstract.md"))
6
- st.write(read_markdown("caveats.md"))
7
  st.write("## Methodology")
8
- col1, col2 = st.beta_columns([1, 1])
9
  col1.image(
10
  "./misc/article/Multilingual-VQA.png",
11
  caption="Masked LM model for Image-text Pretraining.",
@@ -13,6 +12,7 @@ def app(state):
13
  col2.markdown(read_markdown("pretraining.md"))
14
  st.markdown(read_markdown("finetuning.md"))
15
  st.write(read_markdown("challenges.md"))
 
16
  st.write(read_markdown("social_impact.md"))
17
  st.write(read_markdown("references.md"))
18
  st.write(read_markdown("checkpoints.md"))
 
3
 
4
  def app(state):
5
  st.write(read_markdown("abstract.md"))
 
6
  st.write("## Methodology")
7
+ col1, col2 = st.beta_columns([1,2])
8
  col1.image(
9
  "./misc/article/Multilingual-VQA.png",
10
  caption="Masked LM model for Image-text Pretraining.",
 
12
  col2.markdown(read_markdown("pretraining.md"))
13
  st.markdown(read_markdown("finetuning.md"))
14
  st.write(read_markdown("challenges.md"))
15
+ st.write(read_markdown("limitations.md"))
16
  st.write(read_markdown("social_impact.md"))
17
  st.write(read_markdown("references.md"))
18
  st.write(read_markdown("checkpoints.md"))
sections/abstract.md CHANGED
@@ -1,22 +1,17 @@
1
- Visual Question Answering (VQA) is a task where we expect the AI to answer a question about a given image.
2
 
3
- VQA has been an active area of research for the past 4-5 years, with most datasets using natural images found online. Two examples of such datasets: [VQAv2](https://visualqa.org/challenge.html), [GQA](https://cs.stanford.edu/people/dorarad/gqa/about.html). VQA is a particularly interesting multi-modal machine learning challenge because it has several interesting applications across several domains including healthcare chatbots, interactivw-agents, etc.
 
 
 
 
 
4
 
5
- However, most VQA challenges or datasets deal with English-only captions and questions.
6
- In addition, even recent approaches that have been proposed for VQA generally are obscure due to the reasons that CNN-based object detectors are relatively difficult and more complex.
7
-
8
- For example, a FasterRCNN approach uses the following steps:
9
- - a FPN (Feature Pyramid Net) over a ResNet backbone, and
10
- - then a RPN (Regision Proposal Network) layer detects proposals in those features, and
11
- - then the ROI (Region of Interest) heads get the box proposals in the original image, and
12
- - the the boxes are selected using a NMS (Non-max suppression),
13
- - and then the features for selected boxes.
14
-
15
- A major advantage that comes from using transformers is their simplicity and their accessibility - thanks to HuggingFace team, ViT and Transformers authors. For ViT models, for example, all one needs to do is pass the normalized images to the transformer.
16
 
17
  While building a low-resource non-English VQA approach has several benefits of its own, a multilingual VQA task is interesting because it will help create a generic approach/model that works decently well across several languages.
18
 
19
- With the aim of democratizing such an obscure yet interesting task, in this project, we focus on Mutilingual Visual Question Answering (MVQA). Our intention here is to provide a Proof-of-Concept with our simple CLIP Vision + BERT baseline which leverages a multilingual checkpoint with pre-trained image encoders. Our model currently supports for four languages - English, French, German and Spanish.
20
 
21
  We follow the two-staged training approach, our pre-training task being text-only Masked Language Modeling (MLM). Our pre-training dataset comes from Conceptual-12M dataset where we use mBART-50 for translation. Our fine-tuning dataset is taken from the VQAv2 dataset and its translation is done using MarianMT models.
22
 
 
1
+ Visual Question Answering (VQA) is a task where we expect the AI to answer a question about a given image. VQA has been an active area of research for the past 4-5 years, with most datasets using natural images found online. Two examples of such datasets: [VQAv2](https://visualqa.org/challenge.html), [GQA](https://cs.stanford.edu/people/dorarad/gqa/about.html). VQA is a particularly interesting multi-modal machine learning challenge because it has several interesting applications across several domains including healthcare chatbots, interactive-agents, etc. **However, most VQA challenges or datasets deal with English-only captions and questions.**
2
 
3
+ In addition, even recent **approaches that have been proposed for VQA generally are obscure** due to the reasons that CNN-based object detectors are relatively difficult and more complex. For example, a FasterRCNN approach uses the following steps:
4
+ - a FPN (Feature Pyramid Net) over a ResNet backbone, and
5
+ - then a RPN (Regision Proposal Network) layer detects proposals in those features, and
6
+ - then the ROI (Region of Interest) heads get the box proposals in the original image, and
7
+ - the the boxes are selected using a NMS (Non-max suppression),
8
+ - and then the features for selected boxes.
9
 
10
+ A major **advantage that comes from using transformers is their simplicity and their accessibility** - thanks to HuggingFace team, ViT and Transformers authors. For ViT models, for example, all one needs to do is pass the normalized images to the transformer.
 
 
 
 
 
 
 
 
 
 
11
 
12
  While building a low-resource non-English VQA approach has several benefits of its own, a multilingual VQA task is interesting because it will help create a generic approach/model that works decently well across several languages.
13
 
14
+ With the aim of democratizing such an challenging yet interesting task, in this project, we focus on Mutilingual Visual Question Answering (MVQA). Our intention here is to provide a Proof-of-Concept with our simple CLIP Vision + BERT baseline which leverages a multilingual checkpoint with pre-trained image encoders. Our model currently supports for four languages - English, French, German and Spanish.
15
 
16
  We follow the two-staged training approach, our pre-training task being text-only Masked Language Modeling (MLM). Our pre-training dataset comes from Conceptual-12M dataset where we use mBART-50 for translation. Our fine-tuning dataset is taken from the VQAv2 dataset and its translation is done using MarianMT models.
17
 
sections/caveats.md DELETED
@@ -1 +0,0 @@
1
- **Caveats**: The best fine-tuned model only achieves 0.49 accuracy on the multilingual validation data that we create. This could be because of not-so-great quality translations, sub-optimal hyperparameters and lack of ample training. In future, we hope to improve this model by addressing such concerns.
 
 
sections/limitations.md ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ ## Limitations and Bias
2
+ - Our best fine-tuned model only achieves 0.49 accuracy on the multilingual validation data that we create. This could be because of not-so-great quality translations, sub-optimal hyperparameters and lack of ample training. In future, we hope to improve this model by addressing such concerns.