--- tags: - image-to-text - image-captioning license: apache-2.0 metrics: - rouge datasets: - nlphuji/flickr30k widget: - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/savanna.jpg example_title: Savanna - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/football-match.jpg example_title: Football Match - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/airport.jpg example_title: Airport base_model: - google/vit-base-patch16-224-in21k model-index: - name: mozilla/distilvit results: - task: type: image-to-text name: Image To Text dataset: name: nlphuji/flickr30k type: nlphuji/flickr30k metrics: - name: ROUGE-1 type: rouge value: 43.006 verified: true - name: ROUGE-2 type: rouge value: 16.9939 verified: true - name: ROUGE-L type: rouge value: 38.8923 verified: true - name: ROUGE-LSUM type: rouge value: 38.8877 verified: true - name: loss type: loss value: 0.19939416646957397 - name: gen_len type: gen_len value: 11.327256736227712 verified: true --- # distilvit This model is a work in progress. Fine-tuned version of those base models: - a VIT model for the image encoder: https://huggingface.co/google/vit-base-patch16-224-in21k - a Distilled GPT-2 model for the text decoder: https://huggingface.co/distilbert/distilgpt2 This model was trained on: - Flickr30k : https://huggingface.co/datasets/nlphuji/flickr30k - COCO 2017: https://cocodataset.org You can get that checkpoint using the 3083a3cef6e3c8dd90df3f088074bbe836b0f403 commit. It was then further fine-tuned on : - Flickr30k debiased: https://huggingface.co/datasets/Mozilla/flickr30k-transformed-captions - DocOrNot: https://huggingface.co/datasets/Mozilla/docornot You can find the code used to create the model here: https://github.com/mozilla/distilvit ### Framework versions - Transformers 4.40.2 - Pytorch 2.3.0+cu121 - Datasets 2.19.1 - Tokenizers 0.19.1