Spaces:

flax-community
/

Multilingual-VQA

Runtime error

App Files Files Community

gchhablani commited on Jul 24, 2021

Commit

b3c9da2

•

1 Parent(s): 1e0fe04

Add ToC

Browse files

Files changed (15) hide show

apps/article.py +56 -6
sections/acknowledgements.md +0 -1
sections/challenges.md +0 -1
sections/checkpoints.md +0 -8
sections/conclusion.md +0 -0
sections/contributions.md +6 -0
sections/finetuning.md +0 -2
sections/future_work.md +0 -0
sections/intro.md +1 -9
sections/limitations.md +0 -1
sections/other_checkpoints.md +6 -0
sections/pretraining.md +0 -1
sections/references.md +0 -1
sections/social_impact.md +0 -1
toc.py +29 -0

apps/article.py CHANGED Viewed

@@ -1,26 +1,76 @@
 import streamlit as st
 from apps.utils import read_markdown
 from streamlit_tensorboard import st_tensorboard
 def app(state):
     st.info("Welcome to our Multilingual-VQA demo. Please use the navigation sidebar to move to our demo, or scroll below to read all about our project. 🤗")
     st.write(read_markdown("intro.md"))
-    st.write("## Methodology")
     st.write(read_markdown("pretraining.md"))
     st.image(
         "./misc/article/Multilingual-VQA.png",
-        caption="Masked LM model for Image-text Pretraining.",
     )
     st.write("**Training Logs**")
     st_tensorboard(logdir='./logs/pretrain_logs', port=6006)
     st.write(read_markdown("finetuning.md"))
     st.write("**Training Logs**")
     st_tensorboard(logdir='./logs/finetune_logs', port=6007)
     st.write(read_markdown("challenges.md"))
     st.write(read_markdown("limitations.md"))
     st.write(read_markdown("social_impact.md"))
     st.write(read_markdown("references.md"))
     st.write(read_markdown("checkpoints.md"))
-    st.write(read_markdown("acknowledgements.md"))

 import streamlit as st
 from apps.utils import read_markdown
 from streamlit_tensorboard import st_tensorboard
+from toc import Toc
 def app(state):
+    toc = Toc()
+    st.title("Table of contents")
     st.info("Welcome to our Multilingual-VQA demo. Please use the navigation sidebar to move to our demo, or scroll below to read all about our project. 🤗")
+    toc.placeholder()
+    toc.header("Introduction and Motivation")
     st.write(read_markdown("intro.md"))
+    toc.subheader("Novel Contributions")
+    st.write(read_markdown("contributions.md"))
+    toc.header("Methodology")
+    toc.subheader("Pre-training")
     st.write(read_markdown("pretraining.md"))
     st.image(
         "./misc/article/Multilingual-VQA.png",
+        caption="Masked LM model for Image-text Pre-training.",
     )
     st.write("**Training Logs**")
     st_tensorboard(logdir='./logs/pretrain_logs', port=6006)
+    toc.subheader("Finetuning")
     st.write(read_markdown("finetuning.md"))
     st.write("**Training Logs**")
     st_tensorboard(logdir='./logs/finetune_logs', port=6007)
+    toc.header("Challenges and Technical Difficulties")
     st.write(read_markdown("challenges.md"))
+    toc.header("Limitations")
     st.write(read_markdown("limitations.md"))
+    toc.header("Conclusion, Future Work, and Social Impact")
+    toc.subheader("Conclusion")
+    st.write(read_markdown("conclusion.md"))
+    toc.subheader("Future Work")
+    st.write(read_markdown("future_work.md"))
+    toc.subheader("Social Impact")
     st.write(read_markdown("social_impact.md"))
+    toc.header("References")
     st.write(read_markdown("references.md"))
+    toc.header("Checkpoints")
     st.write(read_markdown("checkpoints.md"))
+    toc.subheader("Other Checkpoints")
+    st.write(read_markdown("other_checkpoints.md"))
+    toc.header("Acknowledgements")
+    st.write(read_markdown("acknowledgements.md"))
+toc.title("Title")
+for a in range(10):
+    st.write("Blabla...")
+toc.header("Header 1")
+for a in range(10):
+    st.write("Blabla...")
+toc.header("Header 2")
+for a in range(10):
+    st.write("Blabla...")
+toc.subheader("Subheader 1")
+for a in range(10):
+    st.write("Blabla...")
+toc.subheader("Subheader 2")
+for a in range(10):
+    st.write("Blabla...")
+toc.generate()

sections/acknowledgements.md CHANGED Viewed

@@ -1,4 +1,3 @@
-## Acknowledgements
 We thank [Nilakshan Kunananthaseelan](https://huggingface.co/knilakshan20) for helping us whenever he could get a chance. We also thank [Abheesht Sharma](https://huggingface.co/abheesht) for helping in the discussions in the initial phases. [Luke Melas](https://github.com/lukemelas) helped us get the CC-12M data on our TPU-VMs and we are very grateful to him.
 This project would not be possible without the help of [Patrick](https://huggingface.co/patrickvonplaten) and [Suraj](https://huggingface.co/valhalla) who met with us and helped review our approach and guided us throughout the project.



1	We thank [Nilakshan Kunananthaseelan](https://huggingface.co/knilakshan20) for helping us whenever he could get a chance. We also thank [Abheesht Sharma](https://huggingface.co/abheesht) for helping in the discussions in the initial phases. [Luke Melas](https://github.com/lukemelas) helped us get the CC-12M data on our TPU-VMs and we are very grateful to him.
2
3	This project would not be possible without the help of [Patrick](https://huggingface.co/patrickvonplaten) and [Suraj](https://huggingface.co/valhalla) who met with us and helped review our approach and guided us throughout the project.

sections/challenges.md CHANGED Viewed

@@ -1,4 +1,3 @@
-## Challenges and Technical Difficulties
 We faced challenges at every step of the way, despite having some example scripts and models ready by the 🤗 team in Flax.
 - The dataset we used - Conceptual 12M took 2-3 days to translate using MBart (since we didn't have Marian at the time). The major bottleneck was implementing the translation efficiently. We tried using `mtranslate` first but it turned out to be too slow, even with multiprocessing.



1	We faced challenges at every step of the way, despite having some example scripts and models ready by the 🤗 team in Flax.
2
3	- The dataset we used - Conceptual 12M took 2-3 days to translate using MBart (since we didn't have Marian at the time). The major bottleneck was implementing the translation efficiently. We tried using `mtranslate` first but it turned out to be too slow, even with multiprocessing.

sections/checkpoints.md CHANGED Viewed

@@ -1,11 +1,3 @@
-## Checkpoints
 - Pre-trained checkpoint at 60k steps: [clip-vision-bert-cc12m-60k](https://huggingface.co/flax-community/clip-vision-bert-cc12m-60k)
 - Pre-trained checkpoint at 70k steps: [clip-vision-bert-cc12m-70k](https://huggingface.co/flax-community/clip-vision-bert-cc12m-70k)
 - Fine-tuned checkpoint at 6k steps on 60k pre-trained checkpoint: [clip-vision-bert-vqa-ft-6k](https://huggingface.co/flax-community/clip-vision-bert-vqa-ft-6k)
-### Other checkpoints:
-- All pre-trained checkpoints: [multilingual-vqa](https://huggingface.co/flax-community/multilingual-vqa)
-- Fine-tuned checkpoints on 45k pre-trained checkpoint: [multilingual-vqa-pt-45k-ft](https://huggingface.co/flax-community/multilingual-vqa-pt-45k-ft)
-- Fine-tuned checkpoints on 45k pre-trained checkpoint with AdaFactor (others use AdamW): [multilingual-vqa-pt-45k-ft-adf](https://huggingface.co/flax-community/multilingual-vqa-pt-45k-ft-adf)
-- Fine-tuned checkpoints on 60k pre-trained checkpoint: [multilingual-vqa-pt-60k-ft](https://huggingface.co/flax-community/multilingual-vqa-pt-60k-ft)
-- Fine-tuned checkpoints on 70k pre-trained checkpoint: [multilingual-vqa-pt-60k-ft](https://huggingface.co/flax-community/multilingual-vqa-pt-70k-ft)
-- From scratch (without MLM pre-training) model: [multilingual-vqa-ft](https://huggingface.co/flax-community/multilingual-vqa-ft)

 - Pre-trained checkpoint at 60k steps: [clip-vision-bert-cc12m-60k](https://huggingface.co/flax-community/clip-vision-bert-cc12m-60k)
 - Pre-trained checkpoint at 70k steps: [clip-vision-bert-cc12m-70k](https://huggingface.co/flax-community/clip-vision-bert-cc12m-70k)
 - Fine-tuned checkpoint at 6k steps on 60k pre-trained checkpoint: [clip-vision-bert-vqa-ft-6k](https://huggingface.co/flax-community/clip-vision-bert-vqa-ft-6k)

sections/conclusion.md ADDED Viewed

File without changes

sections/contributions.md ADDED Viewed

	@@ -0,0 +1,6 @@

+Our novel contributions include:
+- A [multilingual variant of the Conceptual-12M dataset](https://huggingface.co/datasets/flax-community/conceptual-12m-mbart-50-multilingual) containing 2.5M image-text pairs each in four languages - English, French, German and Spanish, translated using mBART-50 model.
+- [Multilingual variants of the VQAv2 train and validation sets](https://huggingface.co/datasets/flax-community/multilingual-vqa) containing four times the original data in English, French, German and Spanish, translated using Marian models.
+- [A fusion of CLIP Vision Transformer and BERT model](https://github.com/gchhablani/multilingual-vqa/tree/main/models/flax_clip_vision_bert) where BERT embeddings are concatenated with visual embeddings at the very beginning and passed through BERT self-attention layers. This is based on the [VisualBERT](https://arxiv.org/abs/1908.03557) model.
+- A [pre-trained checkpooint](https://huggingface.co/flax-community/clip-vision-bert-cc12m-70k) on our multilingual with **67.85%** validation accuracy.
+- A [fine-tuned checkpoint](https://huggingface.co/flax-community/clip-vision-bert-vqa-ft-6k) on our multilingual variant of the VQAv2 dataset with **49.76%** validation accuracy.

sections/finetuning.md CHANGED Viewed

@@ -1,5 +1,3 @@
-### Fine-tuning
 **Dataset**
 For fine-tuning, we use the [VQA 2.0](https://visualqa.org/) dataset - particularly, the `train` and `validation` sets. We translate all the questions into the four languages specified above using language-specific MarianMT models. This is because MarianMT models return better labels and are faster, hence, are better for fine-tuning. We get 4x the number of examples in each subset.




1	Dataset
2
3	For fine-tuning, we use the [VQA 2.0](https://visualqa.org/) dataset - particularly, the `train` and `validation` sets. We translate all the questions into the four languages specified above using language-specific MarianMT models. This is because MarianMT models return better labels and are faster, hence, are better for fine-tuning. We get 4x the number of examples in each subset.

sections/future_work.md ADDED Viewed

File without changes

sections/intro.md CHANGED Viewed

@@ -16,12 +16,4 @@ While building a low-resource non-English VQA approach has several benefits of i
 We follow the two-staged training approach, our pre-training task being text-only Masked Language Modeling (MLM). Our pre-training dataset comes from Conceptual-12M dataset where we use mBART-50 for translation. Our fine-tuning dataset is taken from the VQAv2 dataset and its translation is done using MarianMT models.
-Our checkpoints achieve a **validation accuracy of 0.69 on our MLM** task, while our fine-tuned model is able to achieve a **validation accuracy of 0.49 on our multilingual VQAv2 validation set**. With better captions, hyperparameter-tuning, and further training, we expect to see higher performance.
-### Novel Contributions
-Our novel contributions include:
-- A [multilingual variant of the Conceptual-12M dataset](https://huggingface.co/datasets/flax-community/conceptual-12m-mbart-50-multilingual) containing 2.5M image-text pairs each in four languages - English, French, German and Spanish, translated using mBART-50 model.
-- [Multilingual variants of the VQAv2 train and validation sets](https://huggingface.co/datasets/flax-community/multilingual-vqa) containing four times the original data in English, French, German and Spanish, translated using Marian models.
-- [A fusion of CLIP Vision Transformer and BERT model](https://github.com/gchhablani/multilingual-vqa/tree/main/models/flax_clip_vision_bert) where BERT embeddings are concatenated with visual embeddings at the very beginning and passed through BERT self-attention layers. This is based on the [VisualBERT](https://arxiv.org/abs/1908.03557) model.
-- A [pre-trained checkpooint](https://huggingface.co/flax-community/clip-vision-bert-cc12m-70k) on our multilingual with **67.85%** validation accuracy.
-- A [fine-tuned checkpoint](https://huggingface.co/flax-community/clip-vision-bert-vqa-ft-6k) on our multilingual variant of the VQAv2 dataset with **49.76%** validation accuracy.


16
17	We follow the two-staged training approach, our pre-training task being text-only Masked Language Modeling (MLM). Our pre-training dataset comes from Conceptual-12M dataset where we use mBART-50 for translation. Our fine-tuning dataset is taken from the VQAv2 dataset and its translation is done using MarianMT models.
18
19	+ Our checkpoints achieve a validation accuracy of 0.69 on our MLM task, while our fine-tuned model is able to achieve a validation accuracy of 0.49 on our multilingual VQAv2 validation set. With better captions, hyperparameter-tuning, and further training, we expect to see higher performance.

sections/limitations.md CHANGED Viewed

	@@ -1,2 +1 @@
1	- ## Limitations and Bias
2	- Our best fine-tuned model only achieves 0.49 accuracy on the multilingual validation data that we create. This could be because of not-so-great quality translations, sub-optimal hyperparameters and lack of ample training. In future, we hope to improve this model by addressing such concerns.



1	- Our best fine-tuned model only achieves 0.49 accuracy on the multilingual validation data that we create. This could be because of not-so-great quality translations, sub-optimal hyperparameters and lack of ample training. In future, we hope to improve this model by addressing such concerns.

sections/other_checkpoints.md ADDED Viewed

	@@ -0,0 +1,6 @@

+- All pre-trained checkpoints: [multilingual-vqa](https://huggingface.co/flax-community/multilingual-vqa)
+- Fine-tuned checkpoints on 45k pre-trained checkpoint: [multilingual-vqa-pt-45k-ft](https://huggingface.co/flax-community/multilingual-vqa-pt-45k-ft)
+- Fine-tuned checkpoints on 45k pre-trained checkpoint with AdaFactor (others use AdamW): [multilingual-vqa-pt-45k-ft-adf](https://huggingface.co/flax-community/multilingual-vqa-pt-45k-ft-adf)
+- Fine-tuned checkpoints on 60k pre-trained checkpoint: [multilingual-vqa-pt-60k-ft](https://huggingface.co/flax-community/multilingual-vqa-pt-60k-ft)
+- Fine-tuned checkpoints on 70k pre-trained checkpoint: [multilingual-vqa-pt-60k-ft](https://huggingface.co/flax-community/multilingual-vqa-pt-70k-ft)
+- From scratch (without MLM pre-training) model: [multilingual-vqa-ft](https://huggingface.co/flax-community/multilingual-vqa-ft)

sections/pretraining.md CHANGED Viewed

@@ -1,4 +1,3 @@
-### Pretraining
 We follow an approach similar to [VisualBERT](https://arxiv.org/abs/1908.03557). Instead of using a FasterRCNN to get image features, we use a CLIP Vision (ViT transformer) encoder. The pre-training task is text-only MLM (Masked Language Modeling). We mask only the text tokens and try to predict the masked tokens. The VisualBERT authors also use a sentence-image matching task where two captions are matched against an image, but we skip this for the sake of simplicity.
 **Dataset**



1	We follow an approach similar to [VisualBERT](https://arxiv.org/abs/1908.03557). Instead of using a FasterRCNN to get image features, we use a CLIP Vision (ViT transformer) encoder. The pre-training task is text-only MLM (Masked Language Modeling). We mask only the text tokens and try to predict the masked tokens. The VisualBERT authors also use a sentence-image matching task where two captions are matched against an image, but we skip this for the sake of simplicity.
2
3	Dataset

sections/references.md CHANGED Viewed

@@ -1,4 +1,3 @@
-## References
 - [Conceptual 12M Dataset](https://github.com/google-research-datasets/conceptual-12m)
 - [VQA v2 Dataset](https://visualqa.org/challenge.html)



1	- [Conceptual 12M Dataset](https://github.com/google-research-datasets/conceptual-12m)
2
3	- [VQA v2 Dataset](https://visualqa.org/challenge.html)

sections/social_impact.md CHANGED Viewed

	@@ -1,2 +1 @@
1	- ## Social Impact
2	Multilingual Visual Question Answering has not received a lot of attention. There are very few multilingual VQA datasets, and that is what we wanted to address here. Our initial plan was to include 4 high-resource and 4 low-resource languages in our training data. However, the existing translations do not perform as well and we would have received poor labels, not to mention, with a longer training time. We hope to improve this in the future by using better translators (for e.g. Google Translate API) to get more multilingual data, especially in low-resource languages. Regardless, our aim with this project was to provide with a pipeline approach to deal with Multilingual visuo-linguistic pretraining and perform Multilingual Visual Question Answering.



1	Multilingual Visual Question Answering has not received a lot of attention. There are very few multilingual VQA datasets, and that is what we wanted to address here. Our initial plan was to include 4 high-resource and 4 low-resource languages in our training data. However, the existing translations do not perform as well and we would have received poor labels, not to mention, with a longer training time. We hope to improve this in the future by using better translators (for e.g. Google Translate API) to get more multilingual data, especially in low-resource languages. Regardless, our aim with this project was to provide with a pipeline approach to deal with Multilingual visuo-linguistic pretraining and perform Multilingual Visual Question Answering.

toc.py ADDED Viewed

	@@ -0,0 +1,29 @@

+import streamlit as st
+class Toc:
+    def __init__(self):
+        self._items = []
+        self._placeholder = None
+    def title(self, text):
+        self._markdown(text, "h1")
+    def header(self, text):
+        self._markdown(text, "h2", " " * 2)
+    def subheader(self, text):
+        self._markdown(text, "h3", " " * 4)
+    def placeholder(self, sidebar=False):
+        self._placeholder = st.sidebar.empty() if sidebar else st.empty()
+    def generate(self):
+        if self._placeholder:
+            self._placeholder.markdown("\n".join(self._items), unsafe_allow_html=True)
+    def _markdown(self, text, level, space=""):
+        key = "".join(filter(str.isalnum, text)).lower()
+        st.markdown(f"<{level} id='{key}'>{text}</{level}>", unsafe_allow_html=True)
+        self._items.append(f"{space}* <a href='#{key}'>{text}</a>")