sections/pretraining/intro.md · flax-community/Multilingual-VQA at b1f5f2e54039c60ff651599885a17c0948ce4027

We follow an approach similar to VisualBERT. Instead of using a FasterRCNN to get image features, we use a CLIP Vision (ViT transformer) encoder. The pre-training task is text-only MLM (Masked Language Modeling). We mask only the text tokens and try to predict the masked tokens. The VisualBERT authors also use a sentence-image matching task where two captions are matched against an image, but we skip this for the sake of simplicity.