sections/challenges.md · flax-community/spanish-image-captioning at 6c2a73b4adf946d9dfd1c062cc4cd36083c61432

Challenges and Technical Difficulties

We faced challenges at every step of the way, despite having some example scripts and models ready by the 🤗 team in Flax.

The dataset we used - Conceptual 12M took 2-3 days to translate using MBart (since we didn't have Marian at the time). The major bottleneck was implementing the translation efficiently. We tried using mtranslate first but it turned out to be too slow, even with multiprocessing.
The translations with deep learning models aren't as "perfect" as translation APIs like Google and Yandex. This could lead to poor performance.
We prepared the model and config classes for our model from scratch, basing it on CLIP Vision and mBART implementations in Flax. The ViT embeddings should be used inside the BERT embeddings class, which was the major challenge here.
We were only able to get around 1.5 days of training time on TPUs due to above mentioned challenges. We were unable to perform hyperparameter tuning. Our loss curves on the pre-training model show that the training hasn't converged, and we could see further improvement in the BLEU scores.