Spaces:
Runtime error
Runtime error
gchhablani
commited on
Commit
•
4871c6f
1
Parent(s):
24bdb41
Update intro
Browse files- sections/intro/intro.md +1 -3
sections/intro/intro.md
CHANGED
@@ -11,6 +11,4 @@ A major **advantage that comes from using transformers is their simplicity and t
|
|
11 |
|
12 |
While building a low-resource non-English VQA approach has several benefits of its own, a multilingual VQA task is interesting because it will help create a generic approach/model that works decently well across several languages **With the aim of democratizing such a challenging yet interesting task, in this project, we focus on Mutilingual Visual Question Answering (MVQA)**. Our intention here is to provide a Proof-of-Concept with our simple CLIP-Vision-BERT baseline which leverages a multilingual checkpoint with pre-trained image encoders. Our model currently supports for four languages - **English, French, German and Spanish**.
|
13 |
|
14 |
-
We follow the two-staged training approach, our pre-training task being text-only Masked Language Modeling (MLM). Our pre-training dataset comes from Conceptual-12M dataset where we use mBART-50 for translation. Our fine-tuning dataset is taken from the VQAv2 dataset and its translation is done using MarianMT models.
|
15 |
-
|
16 |
-
Our checkpoints achieve a **validation accuracy of 0.69 on our MLM** task, while our fine-tuned model is able to achieve a **validation accuracy of 0.49 on our multilingual VQAv2 validation set**. With better captions, hyperparameter-tuning, and further training, we expect to see higher performance.
|
|
|
11 |
|
12 |
While building a low-resource non-English VQA approach has several benefits of its own, a multilingual VQA task is interesting because it will help create a generic approach/model that works decently well across several languages **With the aim of democratizing such a challenging yet interesting task, in this project, we focus on Mutilingual Visual Question Answering (MVQA)**. Our intention here is to provide a Proof-of-Concept with our simple CLIP-Vision-BERT baseline which leverages a multilingual checkpoint with pre-trained image encoders. Our model currently supports for four languages - **English, French, German and Spanish**.
|
13 |
|
14 |
+
We follow the two-staged training approach, our pre-training task being text-only Masked Language Modeling (MLM). Our pre-training dataset comes from Conceptual-12M dataset where we use mBART-50 for translation. Our fine-tuning dataset is taken from the VQAv2 dataset and its translation is done using MarianMT models. Our checkpoints achieve a **validation accuracy of 0.69 on our MLM** task, while our fine-tuned model is able to achieve a **validation accuracy of 0.49 on our multilingual VQAv2 validation set**. With better captions, hyperparameter-tuning, and further training, we expect to see higher performance.
|
|
|
|