--- tags: - image-to-text - image-captioning license: apache-2.0 widget: - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/savanna.jpg example_title: Savanna - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/football-match.jpg example_title: Football Match - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/airport.jpg example_title: Airport base_model: - distilbert/distilgpt2 - google/vit-base-patch16-224-in21k --- This model is a variation of https://huggingface.co/nlpconnect/vit-gpt2-image-captioning - Read the blog post here https://ziade.org/2024/03/17/distilvit-image-captioning-model - The training code is here: https://github.com/tarekziade/distilvit Results after after 3 epochs (and ~45 hours of training) - eval_loss: 0.19939416646957397 - eval_rouge1: 43.006 - eval_rouge2: 16.9939 - eval_rougeL: 38.8923 - eval_rougeLsum: 38.8877 - eval_gen_len: 11.327256736227712 - eval_runtime: 1816.5255 - eval_samples_per_second: 13.77 - eval_steps_per_second': 1.721 - train_runtime: 46263.3695 - train_samples_per_second: 38.373 - train_steps_per_second: 4.797 - train_loss: 0.05974134062104816