sections/abstract.md · flax-community/multilingual-image-captioning at 306db2e27999d20bc62129a24d783a2242870960

Abstract

This project is focused on Mutilingual Image Captioning. Most of the existing datasets and models on this task work with English-only image-text pairs. Our intention here is to provide a Proof-of-Concept with our CLIP Vision + mBART-50 model can be trained on multilingual textual checkpoints with pre-trained image encoders and made to perform well enough.

Due to lack of good-quality multilingual data, we translate subsets of the Conceptual 12M dataset into English (no translation needed), French, German and Spanish using the MarianMT model belonging to the respective language. With better translated captions, and hyperparameter-tuning, we expect to see higher performance.