tarekziade
/

distilvit

vision-encoder-decoder

image-captioning

Inference Endpoints

Model card Files Files and versions Community

distilvit / README.md

tarekziade's picture

Update README.md

b529973 verified 3 months ago

|

raw history blame contribute delete

No virus

1.22 kB

	---
	tags:
	- image-to-text
	- image-captioning
	license: apache-2.0
	widget:
	- src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/savanna.jpg
	example_title: Savanna
	- src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/football-match.jpg
	example_title: Football Match
	- src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/airport.jpg
	example_title: Airport
	base_model:
	- distilbert/distilgpt2
	- google/vit-base-patch16-224-in21k
	---

	This model is a variation of https://huggingface.co/nlpconnect/vit-gpt2-image-captioning

	- Read the blog post here https://ziade.org/2024/03/17/distilvit-image-captioning-model
	- The training code is here: https://github.com/tarekziade/distilvit

	Results after after 3 epochs (and ~45 hours of training)

	- eval_loss: 0.19939416646957397
	- eval_rouge1: 43.006
	- eval_rouge2: 16.9939
	- eval_rougeL: 38.8923
	- eval_rougeLsum: 38.8877
	- eval_gen_len: 11.327256736227712
	- eval_runtime: 1816.5255
	- eval_samples_per_second: 13.77
	- eval_steps_per_second': 1.721
	- train_runtime: 46263.3695
	- train_samples_per_second: 38.373
	- train_steps_per_second: 4.797
	- train_loss: 0.05974134062104816