sections/abstract.md · flax-community/multilingual-image-captioning at d74f1337fc86404abcae9c0fcb0b02e31698252c

Abstract

This project is focused on Mutilingual Image Captioning. Most of the existing datasets and models on this task work with English-only image-text pairs. Our intention here is to provide a Proof-of-Concept with our CLIP Vision + mBART-50 model can be trained on multilingual textual checkpoints with pre-trained image encoders and made to perform well enough.

Due to lack of good-quality multilingual data, we translate subsets of the Conceptual 12M dataset into English (no translation needed), French, German and Spanish using the MarianMT model belonging to the respective language. With better translated captions, and hyperparameter-tuning, we expect to see higher performance.