sections/challenges.md · flax-community/multilingual-image-captioning at f82fbe091b52f771f7df5dd18c413f78226b1101

Challenges and Technical Difficulties

Training image captioning that too multilingual was a daunting task and we faced challenges at almost every point of this process.

Dataset- Our initial plan was to translate ConceptualCaptions 12M using mTranslate or Yandex but they turned out to be too slow even with multiprocessing. Not having proper translation could lead to poor performance of the trained image-caption model. Then, we translated the whole dataset using MBart50 for all languages which took around 3-4 days. An ideal way would have been to use one model trained on a specific language but at that time no such models were available for specific languages (now Marian is available for the same).
We prepared the model and config classes for our model from scratch, basing it on CLIP model based on ViT-B/32 Image Transformer and mBART50 implementations in FLAX. The CLIP embeddings were to be used inside the mBART50 embeddings class, which was the major challenge here.
RAM issues- Loading and training 10M image-caption dataset led to huge amount of RAM consumption on TPU (~200GB in the first few steps) because of which we had to optimize the script, use less data, and use less num_workers in order to avoid this issue.
We were only able to get around 2 days of training time on TPUs due to aformentioned challenges. We were unable to perform hyperparameter tuning. Our loss curves on the pre-training model show that the training hasn't converged, and we could see further improvement in the loss.