This model is a work in progress.
Fine-tuned version of those base models:
- a VIT model for the image encoder: https://huggingface.co/google/vit-base-patch16-224-in21k
- a Distilled GPT-2 model for the text decoder: https://huggingface.co/distilbert/distilgpt2
This model was trained on:
- Flickr30k : https://huggingface.co/datasets/nlphuji/flickr30k
- COCO 2017: https://cocodataset.org
You can find the code used to create the model here: https://github.com/mozilla/distilvit
- Downloads last month
- 121
Finetuned from
Dataset used to train Mozilla/distilvit
Evaluation results
- ROUGE-1 on nlphuji/flickr30kself-reported43.006
- ROUGE-2 on nlphuji/flickr30kself-reported16.994
- ROUGE-L on nlphuji/flickr30kself-reported38.892
- ROUGE-LSUM on nlphuji/flickr30kself-reported38.888
- loss on nlphuji/flickr30kself-reported0.199
- gen_len on nlphuji/flickr30kself-reported11.327