sections/pretraining.md · flax-community/Multilingual-VQA at a4ce24ce59288673bc6a014b1642f194e3cf7ead

Pretraining

We follow an approach similar to VisualBERT. Instead of using a FasterRCNN to get image features, we use a CLIP Vision (ViT transformer) encoder. The pre-training task is text-only MLM (Masked Language Modeling). We mask only the text tokens and try to predict the masked tokens. The VisualBERT authors also use a sentence-image matching task where two captions are matched against an image, but we skip this for the sake of simplicity.

Dataset

The dataset we use for pre-training is a cleaned version of Conceptual 12M. The dataset is downloaded and then broken images are removed which gives us about 10M images. Then we use the MBart50 mbart-large-50-one-to-many-mmt checkpoint to translate the dataset into four different languages - English, French, German, and Spanish, keeping 2.5 million examples of each language.

Model

The model is shown in the image above. The Dummy MLM Head is actually combined with the MLM head but it never contributes to the MLM loss, hence the name (the predictions on these tokens are ignored). We create a custom model in Flax which integerates the CLIP Vision model inside BERT embeddings. We also use custom configs and modules in order to accomodate for these changes, and allow loading from BERT and CLIP Vision checkpoints. The image is fed to the CLIP Vision encoder and the text is fed to the word-embedding layers of BERT model. We use the bert-base-multilingual-uncased and openai/clip-vit-base-patch32 checkpoints for BERT and CLIP Vision models, respectively. All our code is available on GitHub.