sections/vqa_intro.md · flax-community/Multilingual-VQA at a7e5eb44570ad7f02f2a27b271ca15461ec74519

This demo uses a CLIP-Vision-Bert model checkpoint fine-tuned on a MarianMT-translated version of the VQA v2 dataset. The fine-tuning is performed after pre-training using text-only Masked LM on approximately 10 million image-text pairs taken from the Conceptual 12M dataset translated using MBart. The translations are performed in the following four languages: English, French, German and Spanish.

The model predicts one out of 3129 classes in English which can be found here, and then the translated versions are provided based on the language chosen as Answer Language. The question can be present or written in any of the following: English, French, German and Spanish.

For more details, click on Usage above or Article on the sidebar. 🤗