sections/vqa_intro.md · flax-community/Multilingual-VQA at f39afa2bbff60781a74733b70c9c1a303208a3e7

This demo uses a CLIP-Vision-Bert model checkpoint fine-tuned on a MarianMT-translated version of the VQA v2 dataset. The fine-tuning is performed after pre-training using text-only Masked LM on approximately 10 million image-text pairs taken from the Conceptual 12M dataset translated using MBart. The translations are performed in the following four languages: English, French, German and Spanish.

The model predicts one out of 3129 classes in English which can be found here, and then the translated versions are provided based on the language chosen as Answer Language. The question can be present or written in any of the following: English, French, German and Spanish.

For more details, click on Usage or Article 🤗 above.