gchhablani commited on
Commit
919efff
β€’
1 Parent(s): aa74b67

Update ToC

Browse files
apps/article.py CHANGED
@@ -5,29 +5,46 @@ from .utils import Toc
5
  def app(state):
6
  toc = Toc()
7
  st.info("Welcome to our Multilingual-VQA demo. Please use the navigation sidebar to move to our demo, or scroll below to read all about our project. πŸ€—")
 
8
  st.header("Table of contents")
9
  toc.placeholder()
 
10
  toc.header("Introduction and Motivation")
11
- st.write(read_markdown("intro.md"))
12
  toc.subheader("Novel Contributions")
13
- st.write(read_markdown("contributions.md"))
 
14
  toc.header("Methodology")
 
15
  toc.subheader("Pre-training")
16
- st.write(read_markdown("pretraining.md"))
 
17
  st.image(
18
  "./misc/article/Multilingual-VQA.png",
19
  caption="Masked LM model for Image-text Pre-training.",
20
  )
21
- st.write("**Training Logs**")
 
 
 
 
22
  st_tensorboard(logdir='./logs/pretrain_logs', port=6006)
 
 
23
  toc.subheader("Finetuning")
24
- st.write(read_markdown("finetuning.md"))
25
- st.write("**Training Logs**")
 
 
 
26
  st_tensorboard(logdir='./logs/finetune_logs', port=6007)
 
27
  toc.header("Challenges and Technical Difficulties")
28
  st.write(read_markdown("challenges.md"))
 
29
  toc.header("Limitations")
30
  st.write(read_markdown("limitations.md"))
 
31
  toc.header("Conclusion, Future Work, and Social Impact")
32
  toc.subheader("Conclusion")
33
  st.write(read_markdown("conclusion.md"))
@@ -35,11 +52,15 @@ def app(state):
35
  st.write(read_markdown("future_work.md"))
36
  toc.subheader("Social Impact")
37
  st.write(read_markdown("social_impact.md"))
 
38
  toc.header("References")
39
  st.write(read_markdown("references.md"))
 
40
  toc.header("Checkpoints")
41
  st.write(read_markdown("checkpoints.md"))
42
  toc.subheader("Other Checkpoints")
43
  st.write(read_markdown("other_checkpoints.md"))
 
44
  toc.header("Acknowledgements")
45
- st.write(read_markdown("acknowledgements.md"))
 
 
5
  def app(state):
6
  toc = Toc()
7
  st.info("Welcome to our Multilingual-VQA demo. Please use the navigation sidebar to move to our demo, or scroll below to read all about our project. πŸ€—")
8
+
9
  st.header("Table of contents")
10
  toc.placeholder()
11
+
12
  toc.header("Introduction and Motivation")
13
+ st.write(read_markdown("intro/intro.md"))
14
  toc.subheader("Novel Contributions")
15
+ st.write(read_markdown("intro/contributions.md"))
16
+
17
  toc.header("Methodology")
18
+
19
  toc.subheader("Pre-training")
20
+ st.write(read_markdown("pretraining/intro.md"))
21
+ # col1, col2 = st.beta_columns([5,5])
22
  st.image(
23
  "./misc/article/Multilingual-VQA.png",
24
  caption="Masked LM model for Image-text Pre-training.",
25
  )
26
+ toc.subsubheader("Dataset")
27
+ st.write(read_markdown("pretraining/data.md"))
28
+ toc.subsubheader("Model")
29
+ st.write(read_markdown("pretraining/model.md"))
30
+ toc.subsubheader("Training Logs")
31
  st_tensorboard(logdir='./logs/pretrain_logs', port=6006)
32
+
33
+
34
  toc.subheader("Finetuning")
35
+ toc.subsubheader("Dataset")
36
+ st.write(read_markdown("finetuning/data.md"))
37
+ toc.subsubheader("Model")
38
+ st.write(read_markdown("finetuning/model.md"))
39
+ toc.subsubheader("Training Logs")
40
  st_tensorboard(logdir='./logs/finetune_logs', port=6007)
41
+
42
  toc.header("Challenges and Technical Difficulties")
43
  st.write(read_markdown("challenges.md"))
44
+
45
  toc.header("Limitations")
46
  st.write(read_markdown("limitations.md"))
47
+
48
  toc.header("Conclusion, Future Work, and Social Impact")
49
  toc.subheader("Conclusion")
50
  st.write(read_markdown("conclusion.md"))
 
52
  st.write(read_markdown("future_work.md"))
53
  toc.subheader("Social Impact")
54
  st.write(read_markdown("social_impact.md"))
55
+
56
  toc.header("References")
57
  st.write(read_markdown("references.md"))
58
+
59
  toc.header("Checkpoints")
60
  st.write(read_markdown("checkpoints.md"))
61
  toc.subheader("Other Checkpoints")
62
  st.write(read_markdown("other_checkpoints.md"))
63
+
64
  toc.header("Acknowledgements")
65
+ st.write(read_markdown("acknowledgements.md"))
66
+ toc.generate()
apps/utils.py CHANGED
@@ -23,6 +23,9 @@ class Toc:
23
 
24
  def subheader(self, text):
25
  self._markdown(text, "h3", " " * 4)
 
 
 
26
 
27
  def placeholder(self, sidebar=False):
28
  self._placeholder = st.sidebar.empty() if sidebar else st.empty()
 
23
 
24
  def subheader(self, text):
25
  self._markdown(text, "h3", " " * 4)
26
+
27
+ def subsubheader(self, text):
28
+ self._markdown(text, "h4", " " * 8)
29
 
30
  def placeholder(self, sidebar=False):
31
  self._placeholder = st.sidebar.empty() if sidebar else st.empty()
sections/{checkpoints.md β†’ checkpoints/checkpoints.md} RENAMED
File without changes
sections/{other_checkpoints.md β†’ checkpoints/other_checkpoints.md} RENAMED
File without changes
sections/{conclusion.md β†’ conclusion_future_work/conclusion.md} RENAMED
File without changes
sections/{future_work.md β†’ conclusion_future_work/future_work.md} RENAMED
File without changes
sections/{social_impact.md β†’ conclusion_future_work/social_impact.md} RENAMED
File without changes
sections/finetuning/data.md ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ **Dataset**
2
+
3
+ For fine-tuning, we use the [VQA 2.0](https://visualqa.org/) dataset - particularly, the `train` and `validation` sets. We translate all the questions into the four languages specified above using language-specific MarianMT models. This is because MarianMT models return better labels and are faster, hence, are better for fine-tuning. We get 4x the number of examples in each subset.
sections/{finetuning.md β†’ finetuning/model.md} RENAMED
@@ -1,7 +1,3 @@
1
- **Dataset**
2
-
3
- For fine-tuning, we use the [VQA 2.0](https://visualqa.org/) dataset - particularly, the `train` and `validation` sets. We translate all the questions into the four languages specified above using language-specific MarianMT models. This is because MarianMT models return better labels and are faster, hence, are better for fine-tuning. We get 4x the number of examples in each subset.
4
-
5
  **Model**
6
 
7
  We use the `SequenceClassification` model as reference to create our own sequence classification model. In this, a classification layer is attached on top of the pre-trained BERT model in order to performance multi-class classification. 3129 answer labels are chosen, as is the convention for the English VQA task, which can be found [here](https://github.com/gchhablani/multilingual-vqa/blob/main/answer_mapping.json). These are the same labels used in fine-tuning of the VisualBERT models. The outputs shown here have been translated using the [`mtranslate`](https://github.com/mouuff/mtranslate) Google Translate API library. Then we use various pre-trained checkpoints and train the sequence classification model for various steps.
 
 
 
 
 
1
  **Model**
2
 
3
  We use the `SequenceClassification` model as reference to create our own sequence classification model. In this, a classification layer is attached on top of the pre-trained BERT model in order to performance multi-class classification. 3129 answer labels are chosen, as is the convention for the English VQA task, which can be found [here](https://github.com/gchhablani/multilingual-vqa/blob/main/answer_mapping.json). These are the same labels used in fine-tuning of the VisualBERT models. The outputs shown here have been translated using the [`mtranslate`](https://github.com/mouuff/mtranslate) Google Translate API library. Then we use various pre-trained checkpoints and train the sequence classification model for various steps.
sections/{contributions.md β†’ intro/contributions.md} RENAMED
File without changes
sections/{intro.md β†’ intro/intro.md} RENAMED
File without changes
sections/pretraining/data.md ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ **Dataset**
2
+
3
+ The dataset we use for pre-training is a cleaned version of [Conceptual 12M](https://github.com/google-research-datasets/conceptual-12m). The dataset is downloaded and then broken images are removed which gives us about 10M images. Then we use the MBart50 `mbart-large-50-one-to-many-mmt` checkpoint to translate the dataset into four different languages - English, French, German, and Spanish, keeping 2.5 million examples of each language.
sections/pretraining/intro.md ADDED
@@ -0,0 +1 @@
 
 
1
+ We follow an approach similar to [VisualBERT](https://arxiv.org/abs/1908.03557). Instead of using a FasterRCNN to get image features, we use a CLIP Vision (ViT transformer) encoder. The pre-training task is text-only MLM (Masked Language Modeling). We mask only the text tokens and try to predict the masked tokens. The VisualBERT authors also use a sentence-image matching task where two captions are matched against an image, but we skip this for the sake of simplicity.
sections/{pretraining.md β†’ pretraining/model.md} RENAMED
@@ -1,9 +1,3 @@
1
- We follow an approach similar to [VisualBERT](https://arxiv.org/abs/1908.03557). Instead of using a FasterRCNN to get image features, we use a CLIP Vision (ViT transformer) encoder. The pre-training task is text-only MLM (Masked Language Modeling). We mask only the text tokens and try to predict the masked tokens. The VisualBERT authors also use a sentence-image matching task where two captions are matched against an image, but we skip this for the sake of simplicity.
2
-
3
- **Dataset**
4
-
5
- The dataset we use for pre-training is a cleaned version of [Conceptual 12M](https://github.com/google-research-datasets/conceptual-12m). The dataset is downloaded and then broken images are removed which gives us about 10M images. Then we use the MBart50 `mbart-large-50-one-to-many-mmt` checkpoint to translate the dataset into four different languages - English, French, German, and Spanish, keeping 2.5 million examples of each language.
6
-
7
  **Model**
8
 
9
- The model is shown in the figure below. The `Dummy MLM Head` is actually combined with the MLM head but it never contributes to the MLM loss, hence the name (the predictions on these tokens are ignored). We create a custom model in Flax which integerates the CLIP Vision model inside BERT embeddings. We also use custom configs and modules in order to accomodate for these changes, and allow loading from BERT and CLIP Vision checkpoints. The image is fed to the CLIP Vision encoder and the text is fed to the word-embedding layers of BERT model. We use the `bert-base-multilingual-uncased` and `openai/clip-vit-base-patch32` checkpoints for BERT and CLIP Vision models, respectively. All our code and hyperparameters are available on [GitHub](https://github.com/gchhablani/multilingual-vqa).
 
 
 
 
 
 
 
1
  **Model**
2
 
3
+ The model is shown in the figure below. The `Dummy MLM Head` is actually combined with the MLM head but it never contributes to the MLM loss, hence the name (the predictions on these tokens are ignored). We create a custom model in Flax which integerates the CLIP Vision model inside BERT embeddings. We also use custom configs and modules in order to accomodate for these changes, and allow loading from BERT and CLIP Vision checkpoints. The image is fed to the CLIP Vision encoder and the text is fed to the word-embedding layers of BERT model. We use the `bert-base-multilingual-uncased` and `openai/clip-vit-base-patch32` checkpoints for BERT and CLIP Vision models, respectively. All our code and hyperparameters are available on [GitHub](https://github.com/gchhablani/multilingual-vqa).