gchhablani commited on
Commit
1e0d575
1 Parent(s): 627e34d

Update sections

Browse files
sections/acknowledgements.md CHANGED
@@ -1,2 +1,4 @@
1
  # Acknowledgements
2
- We thank [Nilakshan Kunananthaseelan](https://huggingface.co/knilakshan20) for helping us whenever he could get a chance. We also thank [Abheesht Sharma](https://huggingface.co/abheesht) for helping in the discussions in the initial phases. Lastly, [Luke Melas](https://github.com/lukemelas) helped us get the CC-12M data on our TPU-VMs and we are very grateful to him.
 
 
 
1
  # Acknowledgements
2
+ We thank [Nilakshan Kunananthaseelan](https://huggingface.co/knilakshan20) for helping us whenever he could get a chance. We also thank [Abheesht Sharma](https://huggingface.co/abheesht) for helping in the discussions in the initial phases. [Luke Melas](https://github.com/lukemelas) helped us get the CC-12M data on our TPU-VMs and we are very grateful to him.
3
+
4
+ This project would not be possible without the help of [Patrick](https://huggingface.co/patrickvonplaten) and [Suraj](https://huggingface.co/valhalla) who met with us and helped review our approach and code again and again.
sections/challenges.md CHANGED
@@ -9,6 +9,6 @@ We faced challenges at every step of the way, despite having some example script
9
 
10
  - We prepared a training script for image-text text-only MLM and sequence classification, which we based on hybrid clip, masked LM and the text classification examples.
11
 
12
- - We were only able to get around 1.5 days of training time on TPUs due to above mentioned challenges. We were unable to perform hyperparameter tuning. Our loss curves on the pre-training model show that the training hasn't converged, and we could see further improvement in the MLM accuracy.
13
 
14
  - The VQA dataset, despite having many examples, and after translating into 4x the number of examples, is small and the model overfits. In order to address this, we need more multilingual data, and lighter models, which are both a major challenge right now.
 
9
 
10
  - We prepared a training script for image-text text-only MLM and sequence classification, which we based on hybrid clip, masked LM and the text classification examples.
11
 
12
+ - We were only able to get around 1.5 days of training time on TPUs due to above mentioned challenges. We were unable to perform hyperparameter tuning. Our [loss curves on the pre-training model](https://huggingface.co/flax-community/multilingual-vqa/tensorboard) show that the training hasn't converged, and we could see further improvement in the MLM accuracy.
13
 
14
  - The VQA dataset, despite having many examples, and after translating into 4x the number of examples, is small and the model overfits. In order to address this, we need more multilingual data, and lighter models, which are both a major challenge right now.
sections/usage.md CHANGED
@@ -1,12 +1,12 @@
1
- - This demo loads the `FlaxCLIPVisionBertForSequenceClassificationModel` present in the `model` directory of this repository. The checkpoint is loaded from `ckpt/ckpt-60k-5999` which is pre-trained checkpoint with 60k steps and 5999 fine-tuning steps. 100 random examples are present in the `dummy_vqa_multilingual.tsv` which respective images in the `images/val2014` directory.
2
 
3
- - You can also upload your image using the `Upload your image` file uplaoder and type in a question of your choosing.
4
 
5
- - We provide `English Translation` of the question for users who are not acquainted with the other languages. This is done using `mtranslate` to keep things flexible enough and needs internet connection as it uses the Google Translate API.
6
 
7
  - The model predicts the answers from a list of 3129 answers which have their labels present in `answer_reverse_mapping.json`.
8
 
9
- - Lastly, once can choose the `Answer Language` which is also a saved dictionary created using `mtranslate` library for the 3129 answer options.
10
 
11
  - The top-5 predictions are displayed below and their respective confidence scores are shown in form of a bar plot.
12
 
 
1
+ - This demo loads the `FlaxCLIPVisionBertForSequenceClassificationModel` present in the `model` directory of this repository. The checkpoint is loaded from `ckpt/ckpt-60k-5999` which is pre-trained checkpoint with 60k steps and 5999 fine-tuning steps. 100 random examples are present in the `dummy_vqa_multilingual.tsv` with respective images in the `images/val2014` directory.
2
 
3
+ - You can also upload your image using the `Upload your image` file uploader and type in a question of your choosing in the textual input area.
4
 
5
+ - We provide `English Translation` of the question for users who are not well-acquainted with the other languages. This is done using `mtranslate` to keep things flexible enough and needs internet connection as it uses the Google Translate API.
6
 
7
  - The model predicts the answers from a list of 3129 answers which have their labels present in `answer_reverse_mapping.json`.
8
 
9
+ - Lastly, once can choose the `Answer Language` which also uses a saved dictionary created using `mtranslate` library for the 3129 answer options.
10
 
11
  - The top-5 predictions are displayed below and their respective confidence scores are shown in form of a bar plot.
12