gchhablani's picture
Fix issues
06cb314
|
raw
history blame
1.14 kB

This demo uses a ViTBert model checkpoint fine-tuned on a MarianMT-translated version of the VQA v2 dataset. The fine-tuning is performed after pre-training using text-only Masked LM on approximately 10 million image-text pairs taken from the Conceptual 12M dataset translated using MBart. The translations are performed in the following four languages: English, French, German and Spanish.

The model predicts one out of 3129 classes in English which can be found here, and then the translated versions are provided based on the language chosen as Answer Language. The question can be present or written in any of the following: English, French, German and Spanish.

For more details, click on Usage or Article 🤗 below.