Mozilla
/

distilvit

vision-encoder-decoder

image-captioning

Inference Endpoints

Model card Files Files and versions Community

Edit model card

This model is a work in progress.

Fine-tuned version of those base models:

a VIT model for the image encoder: https://huggingface.co/google/vit-base-patch16-224-in21k
a Distilled GPT-2 model for the text decoder: https://huggingface.co/distilbert/distilgpt2

This model was trained on:

Flickr30k : https://huggingface.co/datasets/nlphuji/flickr30k
COCO 2017: https://cocodataset.org

You can find the code used to create the model here: https://github.com/mozilla/distilvit

Downloads last month: 121

Safetensors

Model size

182M params

Tensor type

F32

·

Finetuned from

Dataset used to train Mozilla/distilvit

Evaluation results

ROUGE-1 on nlphuji/flickr30k
self-reported

43.006
ROUGE-2 on nlphuji/flickr30k
self-reported

16.994
ROUGE-L on nlphuji/flickr30k
self-reported

38.892
ROUGE-LSUM on nlphuji/flickr30k
self-reported

38.888
loss on nlphuji/flickr30k
self-reported

0.199
gen_len on nlphuji/flickr30k
self-reported

11.327

View on Papers With Code