sections/vqa_intro.md · flax-community/Multilingual-VQA at 818748239d4775c058291b3b1550f069ddb73ed0

This demo uses a CLIP-Vision-Bert model checkpoint fine-tuned on a MarianMT-translated version of the VQA v2 dataset. The fine-tuning is performed after pre-training using text-only Masked LM on approximately 10 million image-text pairs taken from the Conceptual 12M dataset translated using MBart. The translations are performed in the following four languages: English, French, German and Spanish.

The model predicts one out of 3129 classes in English which can be found here, and then the translated versions are provided based on the language chosen as Answer Language. The question can be present or written in any of the following: English, French, German and Spanish.

For more details, click on Usage above or Article on the sidebar. 🤗