Mozilla
/

distilvit

vision-encoder-decoder

image-captioning

Inference Endpoints

Model card Files Files and versions Community

distilvit / README.md

tarekziade's picture

Upload README.md with huggingface_hub

2932b88 verified about 16 hours ago

|

raw history blame contribute delete

No virus

2.36 kB

	---
	tags:
	- image-to-text
	- image-captioning
	license: apache-2.0
	metrics:
	- rouge
	datasets:
	- nlphuji/flickr30k
	widget:
	- src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/savanna.jpg
	example_title: Savanna
	- src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/football-match.jpg
	example_title: Football Match
	- src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/airport.jpg
	example_title: Airport
	base_model:
	- google/vit-base-patch16-224-in21k

	model-index:
	- name: mozilla/distilvit
	results:
	- task:
	type: image-to-text
	name: Image To Text
	dataset:
	name: nlphuji/flickr30k
	type: nlphuji/flickr30k
	metrics:
	- name: ROUGE-1
	type: rouge
	value: 43.006
	verified: true
	- name: ROUGE-2
	type: rouge
	value: 16.9939
	verified: true
	- name: ROUGE-L
	type: rouge
	value: 38.8923
	verified: true
	- name: ROUGE-LSUM
	type: rouge
	value: 38.8877
	verified: true
	- name: loss
	type: loss
	value: 0.19939416646957397
	- name: gen_len
	type: gen_len
	value: 11.327256736227712
	verified: true
	---

	# distilvit

	This model is a work in progress. Fine-tuned version of those base models:

	- a VIT model for the image encoder: https://huggingface.co/google/vit-base-patch16-224-in21k
	- a Distilled GPT-2 model for the text decoder: https://huggingface.co/distilbert/distilgpt2

	This model was trained on:

	- Flickr30k : https://huggingface.co/datasets/nlphuji/flickr30k
	- COCO 2017: https://cocodataset.org

	You can get that checkpoint using the 3083a3cef6e3c8dd90df3f088074bbe836b0f403 commit.

	It was then further fine-tuned on :

	- Flickr30k debiased: https://huggingface.co/datasets/Mozilla/flickr30k-transformed-captions
	- DocOrNot: https://huggingface.co/datasets/Mozilla/docornot

	You can find the code used to create the model here: https://github.com/mozilla/distilvit

	### Framework versions

	- Transformers 4.40.2
	- Pytorch 2.3.0+cu121
	- Datasets 2.19.1
	- Tokenizers 0.19.1