gchhablani's picture
Fix image display issue.
547e7ab

Abstract

This project is focused on Spanish Image Captioning. Most of the existing datasets and models on this task work with English-only image-text pairs. Our intention here is to show that CLIP Vision + Marian model can be trained on Spanish translation textual checkpoints with pre-trained image encoders and made to perform well enough on this particular task.

Due to lack of good-quality Spanish data, we translate subsets of the Conceptual 12M dataset into Spanish using the Marian MT Helsinki-NLP/opus-mt-en-es model. With better translated captions, and hyperparameter-tuning, we expect to see higher performance.