Spaces:

flax-community
/

Multilingual-VQA

Runtime error

App Files Files Community

Multilingual-VQA / sections /intro /intro_part_2.md

gchhablani's picture

Put FasterRCNN steps in Beta Expander

3bd4b4e almost 3 years ago

|

No virus

1.59 kB

	A major advantage that comes from using transformers is their simplicity and their accessibility - thanks to HuggingFace, ViT and Transformers. For ViT models, for example, all one needs to do is pass the normalized images to the transformer.

	While building a low-resource non-English VQA approach has several benefits of its own, a multilingual VQA task is interesting because it will help create a generic model that works well across several languages. And then, it can be fine-tuned in low-resource settings to leverage pre-training improvements. With the aim of democratizing such a challenging yet interesting task, in this project, we focus on Mutilingual Visual Question Answering (MVQA). Our intention here is to provide a Proof-of-Concept with our simple CLIP-Vision-BERT baseline which leverages a multilingual checkpoint with pre-trained image encoders. Our model currently supports for four languages - English, French, German and Spanish.

	We follow the two-staged training approach, our pre-training task being text-only Masked Language Modeling (MLM). Our pre-training dataset comes from Conceptual-12M dataset where we use mBART-50 for translation. Our fine-tuning dataset is taken from the VQAv2 dataset and its translation is done using MarianMT models. Our checkpoints achieve a validation accuracy of 0.69 on our MLM task, while our fine-tuned model is able to achieve a validation accuracy of 0.49 on our multilingual VQAv2 validation set. With better captions, hyperparameter-tuning, and further training, we expect to see higher performance.