sections/abstract.md · flax-community/spanish-image-captioning at 6c2a73b4adf946d9dfd1c062cc4cd36083c61432

Abstract

This project is focused on Spanish Image Captioning. Most of the existing datasets and models on this task work with English-only image-text pairs. Our intention here is to show that CLIP Vision + Marian model can be trained on Spanish translation textual checkpoints with pre-trained image encoders and made to perform well enough on this particular task.

Due to lack of good-quality Spanish data, we translate subsets of the Conceptual 12M dataset into Spanish using the Marian MT Helsinki-NLP/opus-mt-en-es model. With better translated captions, and hyperparameter-tuning, we expect to see higher performance.