## Challenges and Technical Difficulties We faced challenges at every step of the way, despite having some example scripts and models ready by the 🤗 team in Flax. - The dataset we used - Conceptual 12M took 2-3 days to translate using MBart (since we didn't have Marian at the time). The major bottleneck was implementing the translation efficiently. We tried using `mtranslate` first but it turned out to be too slow, even with multiprocessing. - The translations with deep learning models aren't as "perfect" as translation APIs like Google and Yandex. This could lead to poor performance. - We prepared the model and config classes for our model from scratch, basing it on `CLIP Vision` and `mBART` implementations in Flax. The ViT embeddings should be used inside the BERT embeddings class, which was the major challenge here. - We were only able to get around 1.5 days of training time on TPUs due to above mentioned challenges. We were unable to perform hyperparameter tuning. Our [loss curves on the pre-training model](https://huggingface.co/flax-community/spanish-image-captioning/tensorboard) show that the training hasn't converged, and we could see further improvement in the BLEU scores.