gchhablani commited on
Commit
82bb660
1 Parent(s): 228e576

Fix write-up

Browse files
sections/abstract.md CHANGED
@@ -1,2 +1,4 @@
1
  ## Abstract
2
- This project is focused on Mutilingual Visual Question Answering. Most of the existing datasets and models on this task work with English-only image-text pairs. Our intention here is to provide a Proof-of-Concept with our simple ViT+BERT model which can be trained on multilingual text checkpoints with pre-trained image encoders and made to perform well enough. Due to lack of good-quality multilingual data, we translate subsets of the Conceptual 12M dataset into English (already in English), French, German and Spanish using the Marian models. We achieved 0.49 accuracy on the multilingual validation set we created. With better captions, and hyperparameter-tuning, we expect to see higher performance.
 
 
 
1
  ## Abstract
2
+ This project is focused on Mutilingual Visual Question Answering. Most of the existing datasets and models on this task work with English-only image-text pairs. Our intention here is to provide a Proof-of-Concept with our simple CLIP Vision + BERT model which can be trained on multilingual text checkpoints with pre-trained image encoders and made to perform well enough.
3
+
4
+ Due to lack of good-quality multilingual data, we translate subsets of the Conceptual 12M dataset into English (already in English), French, German and Spanish using the mBART-50 models. We achieved 0.49 accuracy on the multilingual validation set we created. With better captions, and hyperparameter-tuning, we expect to see higher performance.
sections/challenges.md CHANGED
@@ -5,7 +5,7 @@ We faced challenges at every step of the way, despite having some example script
5
 
6
  - The translations with deep learning models aren't as "perfect" as translation APIs like Google and Yandex. This could lead to poor performance.
7
 
8
- - We prepared the model and config classes for our model from scratch, basing it on `ViT` and `BERT` implementations in Flax. The ViT embeddings should be used inside the BERT embeddings class, which was the major challenge here.
9
 
10
  - We prepared a training script for image-text text-only MLM and sequence classification, which we based on hybrid clip, masked LM and the text classification examples.
11
 
 
5
 
6
  - The translations with deep learning models aren't as "perfect" as translation APIs like Google and Yandex. This could lead to poor performance.
7
 
8
+ - We prepared the model and config classes for our model from scratch, basing it on `CLIP Vision` and `BERT` implementations in Flax. The ViT embeddings should be used inside the BERT embeddings class, which was the major challenge here.
9
 
10
  - We prepared a training script for image-text text-only MLM and sequence classification, which we based on hybrid clip, masked LM and the text classification examples.
11
 
sections/pretraining.md CHANGED
@@ -1,5 +1,5 @@
1
  ### Pretraining
2
- We follow an approach similar to [VisualBERT](https://arxiv.org/abs/1908.03557). Instead of using a FasterRCNN to get image features, we use a ViT encoder. The pre-training task is text-only MLM (Masked Language Modeling). We mask only the text tokens and try to predict the masked tokens. The VisualBERT authors also use a sentence-image matching task where two captions are matched against an image, but we skip this for the sake of simplicity.
3
 
4
  **Dataset**
5
 
@@ -7,4 +7,4 @@ The dataset we use for pre-training is a cleaned version of [Conceptual 12M](htt
7
 
8
  **Model**
9
 
10
- The model is shown in the image above. The `Dummy MLM Head` is actually combined with the MLM head but it never contributes to the MLM loss, hence the name (the predictions on these tokens are ignored). We create a custom model in Flax which integerates the ViT model inside BERT embeddings. We also use custom configs and modules in order to accomodate for these changes, and allow loading from BERT and ViT checkpoints. The image is fed to the ViT encoder and the text is fed to the word-embedding layers of BERT model. We use the `bert-base-multilingual-uncased` and `openai/clip-vit-base-patch32` checkpoints for BERT and ViT (actually CLIPVision) models, respectively. All our code is available on [GitHub](https://github.com/gchhablani/multilingual-vqa).
 
1
  ### Pretraining
2
+ We follow an approach similar to [VisualBERT](https://arxiv.org/abs/1908.03557). Instead of using a FasterRCNN to get image features, we use a CLIP Vision (ViT transformer) encoder. The pre-training task is text-only MLM (Masked Language Modeling). We mask only the text tokens and try to predict the masked tokens. The VisualBERT authors also use a sentence-image matching task where two captions are matched against an image, but we skip this for the sake of simplicity.
3
 
4
  **Dataset**
5
 
 
7
 
8
  **Model**
9
 
10
+ The model is shown in the image above. The `Dummy MLM Head` is actually combined with the MLM head but it never contributes to the MLM loss, hence the name (the predictions on these tokens are ignored). We create a custom model in Flax which integerates the CLIP Vision model inside BERT embeddings. We also use custom configs and modules in order to accomodate for these changes, and allow loading from BERT and CLIP Vision checkpoints. The image is fed to the CLIP Vision encoder and the text is fed to the word-embedding layers of BERT model. We use the `bert-base-multilingual-uncased` and `openai/clip-vit-base-patch32` checkpoints for BERT and CLIP Vision models, respectively. All our code is available on [GitHub](https://github.com/gchhablani/multilingual-vqa).