sections/mlm_intro.md · flax-community/Multilingual-VQA at 4f6b7244c4514bfde2c38e83f3b1bd57b3c34595

This demo uses a CLIP-Vision-Bert model checkpoint pre-trained using text-only Masked LM on approximately 10 million image-text pairs taken from the Conceptual 12M dataset translated using MBart. The translations are performed in the following four languages: English, French, German and Spanish, giving 2.5M examples in each language.

The model can be used for mask-filling as shown in this demo. The caption can be present or written in any of the following: English, French, German and Spanish.

For more details, click on Usage above or Article on the sidebar. 🤗