File size: 1,224 Bytes
3d471a4
9e1c5e2
 
 
3d471a4
9e1c5e2
 
 
 
 
 
 
a406da5
6d0f868
 
3d471a4
f4c2b16
3dbad8a
f4c2b16
b529973
 
f4c6802
3dbad8a
f4c2b16
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9e1c5e2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
---
tags:
- image-to-text
- image-captioning
license: apache-2.0
widget:
- src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/savanna.jpg
  example_title: Savanna
- src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/football-match.jpg
  example_title: Football Match
- src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/airport.jpg
  example_title: Airport
base_model: 
- distilbert/distilgpt2
- google/vit-base-patch16-224-in21k
---

This model is a variation of https://huggingface.co/nlpconnect/vit-gpt2-image-captioning

- Read the blog post here https://ziade.org/2024/03/17/distilvit-image-captioning-model
- The training code is here: https://github.com/tarekziade/distilvit

Results after after 3 epochs (and ~45 hours of training)

- eval_loss: 0.19939416646957397
- eval_rouge1: 43.006
- eval_rouge2: 16.9939
- eval_rougeL: 38.8923
- eval_rougeLsum: 38.8877
- eval_gen_len: 11.327256736227712
- eval_runtime: 1816.5255
- eval_samples_per_second: 13.77
- eval_steps_per_second': 1.721                                     
- train_runtime: 46263.3695
- train_samples_per_second: 38.373
- train_steps_per_second: 4.797
- train_loss: 0.05974134062104816