Multilingual-VQA / sections /vqa_intro.md
gchhablani's picture
Fix app and itnro
70de33c
|
raw
history blame
No virus
1.15 kB
This demo uses a [CLIP-Vision-Bert model checkpoint](https://huggingface.co/flax-community/clip-vision-bert-vqa-ft-6k) fine-tuned on a [MarianMT](https://huggingface.co/transformers/model_doc/marian.html)-translated version of the [VQA v2 dataset](https://visualqa.org/challenge.html). The fine-tuning is performed after pre-training using text-only Masked LM on approximately 10 million image-text pairs taken from the [Conceptual 12M dataset](https://github.com/google-research-datasets/conceptual-12m) translated using [MBart](https://huggingface.co/transformers/model_doc/mbart.html). The translations are performed in the following four languages: English, French, German and Spanish.
The model predicts one out of 3129 classes in English which can be found [here](https://huggingface.co/spaces/flax-community/Multilingual-VQA/blob/main/answer_reverse_mapping.json), and then the translated versions are provided based on the language chosen as `Answer Language`. The question can be present or written in any of the following: English, French, German and Spanish.
For more details, click on `Usage` above or `Article` on the sidebar. πŸ€—