Spaces:
Runtime error
Runtime error
gchhablani
commited on
Commit
•
d8dbe3e
1
Parent(s):
b4c0878
Change order in article
Browse files- app.py +2 -2
- apps/article.py +1 -1
- sections/intro.md +1 -1
- sections/pretraining.md +1 -1
app.py
CHANGED
@@ -7,13 +7,13 @@ from apps.utils import read_markdown
|
|
7 |
def main():
|
8 |
state = _get_state()
|
9 |
st.set_page_config(
|
10 |
-
page_title="Multilingual
|
11 |
layout="wide",
|
12 |
initial_sidebar_state="collapsed",
|
13 |
page_icon="./misc/mvqa-logo-3-white.png",
|
14 |
)
|
15 |
|
16 |
-
|
17 |
st.write(
|
18 |
"[Gunjan Chhablani](https://huggingface.co/gchhablani), [Bhavitvya Malik](https://huggingface.co/bhavitvyamalik)"
|
19 |
)
|
|
|
7 |
def main():
|
8 |
state = _get_state()
|
9 |
st.set_page_config(
|
10 |
+
page_title="Multilingual VQA",
|
11 |
layout="wide",
|
12 |
initial_sidebar_state="collapsed",
|
13 |
page_icon="./misc/mvqa-logo-3-white.png",
|
14 |
)
|
15 |
|
16 |
+
st.title("Multilingual Visual Question Answering")
|
17 |
st.write(
|
18 |
"[Gunjan Chhablani](https://huggingface.co/gchhablani), [Bhavitvya Malik](https://huggingface.co/bhavitvyamalik)"
|
19 |
)
|
apps/article.py
CHANGED
@@ -4,11 +4,11 @@ from apps.utils import read_markdown
|
|
4 |
def app(state):
|
5 |
st.write(read_markdown("intro.md"))
|
6 |
st.write("## Methodology")
|
|
|
7 |
st.image(
|
8 |
"./misc/article/Multilingual-VQA.png",
|
9 |
caption="Masked LM model for Image-text Pretraining.",
|
10 |
)
|
11 |
-
st.markdown(read_markdown("pretraining.md"))
|
12 |
st.markdown(read_markdown("finetuning.md"))
|
13 |
st.write(read_markdown("challenges.md"))
|
14 |
st.write(read_markdown("limitations.md"))
|
|
|
4 |
def app(state):
|
5 |
st.write(read_markdown("intro.md"))
|
6 |
st.write("## Methodology")
|
7 |
+
st.markdown(read_markdown("pretraining.md"))
|
8 |
st.image(
|
9 |
"./misc/article/Multilingual-VQA.png",
|
10 |
caption="Masked LM model for Image-text Pretraining.",
|
11 |
)
|
|
|
12 |
st.markdown(read_markdown("finetuning.md"))
|
13 |
st.write(read_markdown("challenges.md"))
|
14 |
st.write(read_markdown("limitations.md"))
|
sections/intro.md
CHANGED
@@ -1,7 +1,7 @@
|
|
1 |
## Introduction
|
2 |
Visual Question Answering (VQA) is a task where we expect the AI to answer a question about a given image. VQA has been an active area of research for the past 4-5 years, with most datasets using natural images found online. Two examples of such datasets are: [VQAv2](https://visualqa.org/challenge.html), [GQA](https://cs.stanford.edu/people/dorarad/gqa/about.html). VQA is a particularly interesting multi-modal machine learning challenge because it has several interesting applications across several domains including healthcare chatbots, interactive-agents, etc. **However, most VQA challenges or datasets deal with English-only captions and questions.**
|
3 |
|
4 |
-
In addition, even recent **approaches that have been proposed for VQA generally are obscure** due to the
|
5 |
- a FPN (Feature Pyramid Net) over a ResNet backbone, and
|
6 |
- then a RPN (Regision Proposal Network) layer detects proposals in those features, and
|
7 |
- then the ROI (Region of Interest) heads get the box proposals in the original image, and
|
|
|
1 |
## Introduction
|
2 |
Visual Question Answering (VQA) is a task where we expect the AI to answer a question about a given image. VQA has been an active area of research for the past 4-5 years, with most datasets using natural images found online. Two examples of such datasets are: [VQAv2](https://visualqa.org/challenge.html), [GQA](https://cs.stanford.edu/people/dorarad/gqa/about.html). VQA is a particularly interesting multi-modal machine learning challenge because it has several interesting applications across several domains including healthcare chatbots, interactive-agents, etc. **However, most VQA challenges or datasets deal with English-only captions and questions.**
|
3 |
|
4 |
+
In addition, even recent **approaches that have been proposed for VQA generally are obscure** due to the fact that CNN-based object detectors are relatively difficult to use and more complex for feature extraction. For example, a FasterRCNN approach uses the following steps:
|
5 |
- a FPN (Feature Pyramid Net) over a ResNet backbone, and
|
6 |
- then a RPN (Regision Proposal Network) layer detects proposals in those features, and
|
7 |
- then the ROI (Region of Interest) heads get the box proposals in the original image, and
|
sections/pretraining.md
CHANGED
@@ -7,4 +7,4 @@ The dataset we use for pre-training is a cleaned version of [Conceptual 12M](htt
|
|
7 |
|
8 |
**Model**
|
9 |
|
10 |
-
The model is shown in the image
|
|
|
7 |
|
8 |
**Model**
|
9 |
|
10 |
+
The model is shown in the image below. The `Dummy MLM Head` is actually combined with the MLM head but it never contributes to the MLM loss, hence the name (the predictions on these tokens are ignored). We create a custom model in Flax which integerates the CLIP Vision model inside BERT embeddings. We also use custom configs and modules in order to accomodate for these changes, and allow loading from BERT and CLIP Vision checkpoints. The image is fed to the CLIP Vision encoder and the text is fed to the word-embedding layers of BERT model. We use the `bert-base-multilingual-uncased` and `openai/clip-vit-base-patch32` checkpoints for BERT and CLIP Vision models, respectively. All our code is available on [GitHub](https://github.com/gchhablani/multilingual-vqa).
|