We faced challenges at every step of the way, despite having some example scripts and models ready by the 🤗 team in Flax. - The dataset we used - Conceptual 12M took 2-3 days to translate using MBart (since we didn't have Marian at the time). The major bottleneck was implementing the translation efficiently. We tried using `mtranslate` first but it turned out to be too slow, even with multiprocessing. - The translations with deep learning models aren't as "perfect" as translation APIs like Google and Yandex. This could lead to poor performance. - We prepared the model and config classes for our model from scratch, basing it on `CLIP Vision` and `BERT` implementations in Flax. The ViT embeddings should be used inside the BERT embeddings class, which was the major challenge here. - We prepared a training script for image-text text-only MLM and sequence classification, which we based on hybrid clip, masked LM and the text classification examples. - We were only able to get around 1.5 days of training time on TPUs due to above mentioned challenges. We were unable to perform hyperparameter tuning. Our [loss curves on the pre-training](https://huggingface.co/flax-community/multilingual-vqa/tensorboard) show that the training hasn't converged, and we could see further improvement in the MLM accuracy. - The VQA dataset, despite having many examples, and after translating into 4x the number of examples, is small and the model overfits. In order to address this, we need more multilingual data, and lighter models, which are both a major challenge right now.