Spaces:
Runtime error
Runtime error
gchhablani
commited on
Commit
•
fc3079d
1
Parent(s):
9aabff8
Add pretraining logs
Browse files- apps/article.py +6 -2
- logs/pretrain_logs/pretrain_events_mvqa +0 -0
- requirements.txt +2 -1
- sections/challenges.md +1 -1
- sections/logs.md +0 -0
- sections/pretraining.md +1 -1
apps/article.py
CHANGED
@@ -1,15 +1,19 @@
|
|
1 |
import streamlit as st
|
2 |
from apps.utils import read_markdown
|
|
|
3 |
|
4 |
def app(state):
|
5 |
st.write(read_markdown("intro.md"))
|
6 |
st.write("## Methodology")
|
7 |
-
st.
|
8 |
st.image(
|
9 |
"./misc/article/Multilingual-VQA.png",
|
10 |
caption="Masked LM model for Image-text Pretraining.",
|
11 |
)
|
12 |
-
st.
|
|
|
|
|
|
|
13 |
st.write(read_markdown("challenges.md"))
|
14 |
st.write(read_markdown("limitations.md"))
|
15 |
st.write(read_markdown("social_impact.md"))
|
|
|
1 |
import streamlit as st
|
2 |
from apps.utils import read_markdown
|
3 |
+
from streamlit_tensorboard import st_tensorboard
|
4 |
|
5 |
def app(state):
|
6 |
st.write(read_markdown("intro.md"))
|
7 |
st.write("## Methodology")
|
8 |
+
st.write(read_markdown("pretraining.md"))
|
9 |
st.image(
|
10 |
"./misc/article/Multilingual-VQA.png",
|
11 |
caption="Masked LM model for Image-text Pretraining.",
|
12 |
)
|
13 |
+
st.write("**Training Logs**")
|
14 |
+
st_tensorboard(logdir='./logs/pretrain_logs', port=6006)
|
15 |
+
|
16 |
+
st.write(read_markdown("finetuning.md"))
|
17 |
st.write(read_markdown("challenges.md"))
|
18 |
st.write(read_markdown("limitations.md"))
|
19 |
st.write(read_markdown("social_impact.md"))
|
logs/pretrain_logs/pretrain_events_mvqa
ADDED
Binary file (9.47 MB). View file
|
|
requirements.txt
CHANGED
@@ -5,4 +5,5 @@ torchvision==0.10.0
|
|
5 |
mtranslate==1.8
|
6 |
black==21.7b0
|
7 |
flax==0.3.4
|
8 |
-
torch==1.9.0
|
|
|
|
5 |
mtranslate==1.8
|
6 |
black==21.7b0
|
7 |
flax==0.3.4
|
8 |
+
torch==1.9.0
|
9 |
+
streamlit-tensorboard==0.0.2
|
sections/challenges.md
CHANGED
@@ -9,6 +9,6 @@ We faced challenges at every step of the way, despite having some example script
|
|
9 |
|
10 |
- We prepared a training script for image-text text-only MLM and sequence classification, which we based on hybrid clip, masked LM and the text classification examples.
|
11 |
|
12 |
-
- We were only able to get around 1.5 days of training time on TPUs due to above mentioned challenges. We were unable to perform hyperparameter tuning. Our [loss curves on the pre-training
|
13 |
|
14 |
- The VQA dataset, despite having many examples, and after translating into 4x the number of examples, is small and the model overfits. In order to address this, we need more multilingual data, and lighter models, which are both a major challenge right now.
|
|
|
9 |
|
10 |
- We prepared a training script for image-text text-only MLM and sequence classification, which we based on hybrid clip, masked LM and the text classification examples.
|
11 |
|
12 |
+
- We were only able to get around 1.5 days of training time on TPUs due to above mentioned challenges. We were unable to perform hyperparameter tuning. Our [loss curves on the pre-training](https://huggingface.co/flax-community/multilingual-vqa/tensorboard) show that the training hasn't converged, and we could see further improvement in the MLM accuracy.
|
13 |
|
14 |
- The VQA dataset, despite having many examples, and after translating into 4x the number of examples, is small and the model overfits. In order to address this, we need more multilingual data, and lighter models, which are both a major challenge right now.
|
sections/logs.md
ADDED
File without changes
|
sections/pretraining.md
CHANGED
@@ -7,4 +7,4 @@ The dataset we use for pre-training is a cleaned version of [Conceptual 12M](htt
|
|
7 |
|
8 |
**Model**
|
9 |
|
10 |
-
The model is shown in the image below. The `Dummy MLM Head` is actually combined with the MLM head but it never contributes to the MLM loss, hence the name (the predictions on these tokens are ignored). We create a custom model in Flax which integerates the CLIP Vision model inside BERT embeddings. We also use custom configs and modules in order to accomodate for these changes, and allow loading from BERT and CLIP Vision checkpoints. The image is fed to the CLIP Vision encoder and the text is fed to the word-embedding layers of BERT model. We use the `bert-base-multilingual-uncased` and `openai/clip-vit-base-patch32` checkpoints for BERT and CLIP Vision models, respectively. All our code
|
|
|
7 |
|
8 |
**Model**
|
9 |
|
10 |
+
The model is shown in the image below. The `Dummy MLM Head` is actually combined with the MLM head but it never contributes to the MLM loss, hence the name (the predictions on these tokens are ignored). We create a custom model in Flax which integerates the CLIP Vision model inside BERT embeddings. We also use custom configs and modules in order to accomodate for these changes, and allow loading from BERT and CLIP Vision checkpoints. The image is fed to the CLIP Vision encoder and the text is fed to the word-embedding layers of BERT model. We use the `bert-base-multilingual-uncased` and `openai/clip-vit-base-patch32` checkpoints for BERT and CLIP Vision models, respectively. All our code and hyperparameters are available on [GitHub](https://github.com/gchhablani/multilingual-vqa).
|