tarekziade commited on
Commit
ad7371e
·
verified ·
1 Parent(s): 2963c42

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +63 -78
README.md CHANGED
@@ -1,78 +1,63 @@
1
- ---
2
- tags:
3
- - image-to-text
4
- - image-captioning
5
- license: apache-2.0
6
- metrics:
7
- - rouge
8
- datasets:
9
- - Mozilla/flickr30k-transformed-captions
10
- widget:
11
- - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/savanna.jpg
12
- example_title: Savanna
13
- - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/football-match.jpg
14
- example_title: Football Match
15
- - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/airport.jpg
16
- example_title: Airport
17
- base_model:
18
- - google/vit-base-patch16-224-in21k
19
-
20
- model-index:
21
- - name: mozilla/distilvit
22
- results:
23
- - task:
24
- type: image-to-text
25
- name: Image To Text
26
- dataset:
27
- name: Mozilla/flickr30k-transformed-captions
28
- type: Mozilla/flickr30k-transformed-captions
29
- metrics:
30
- - name: ROUGE-1
31
- type: rouge
32
- value: 43.006
33
- verified: true
34
- - name: ROUGE-2
35
- type: rouge
36
- value: 16.9939
37
- verified: true
38
- - name: ROUGE-L
39
- type: rouge
40
- value: 38.8923
41
- verified: true
42
- - name: ROUGE-LSUM
43
- type: rouge
44
- value: 38.8877
45
- verified: true
46
- - name: loss
47
- type: loss
48
- value: 0.19939416646957397
49
- - name: gen_len
50
- type: gen_len
51
- value: 11.327256736227712
52
- verified: true
53
- ---
54
-
55
- # distilvit
56
-
57
- This model is a work in progress. Fine-tuned version of those base models:
58
-
59
- - a VIT model for the image encoder: https://huggingface.co/google/vit-base-patch16-224-in21k
60
- - a Distilled GPT-2 model for the text decoder: https://huggingface.co/distilbert/distilgpt2
61
-
62
- This model was trained on:
63
-
64
- - Flickr30k : https://huggingface.co/datasets/nlphuji/flickr30k
65
- - COCO 2017: https://cocodataset.org
66
-
67
- You can get that checkpoint using the 3083a3cef6e3c8dd90df3f088074bbe836b0f403 commit.
68
-
69
- It was then further fine-tuned on :
70
-
71
- - [Flickr30k debiased](https://huggingface.co/datasets/Mozilla/flickr30k-transformed-captions)
72
- - [DocOrNot](https://huggingface.co/datasets/Mozilla/docornot)
73
- - [Alt Text Validation](https://huggingface.co/datasets/Mozilla/alt-text-validation)
74
-
75
- For the latter, the dataset was annotated by our team to correct the alt text generated by the model,
76
- using the [checkvite tool](https://github.com/mozila/checkvite).
77
-
78
- You can find the code used to create the model here: https://github.com/mozilla/distilvit
 
1
+ ---
2
+ tags:
3
+ - image-to-text
4
+ - image-captioning
5
+ license: apache-2.0
6
+ metrics:
7
+ - rouge
8
+ datasets:
9
+ - Mozilla/flickr30k-transformed-captions-gpt4o
10
+ widget:
11
+ - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/savanna.jpg
12
+ example_title: Savanna
13
+ - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/football-match.jpg
14
+ example_title: Football Match
15
+ - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/airport.jpg
16
+ example_title: Airport
17
+ base_model:
18
+ - google/vit-base-patch16-224-in21k
19
+ ---
20
+
21
+ # distilvit
22
+
23
+ This model is a work in progress. Fine-tuned version of those base models:
24
+
25
+ - a VIT model for the image encoder: https://huggingface.co/google/vit-base-patch16-224-in21k
26
+ - a Distilled GPT-2 model for the text decoder: https://huggingface.co/distilbert/distilgpt2
27
+
28
+ This model was trained on:
29
+
30
+ - [A debiased version of COCO 2017](https://huggingface.co/datasets/Mozilla/coco-gpt4o)
31
+ - [A debiased version of Flickr30k](https://huggingface.co/datasets/Mozilla/flickr30k-transformed-captions-gpt4o)
32
+ - [Images from pexels](https://huggingface.co/datasets/Mozilla/pexels-gpt4o)
33
+ - [DocOrNot](https://huggingface.co/datasets/Mozilla/docornot)
34
+ - [Alt Text Validation](https://huggingface.co/datasets/Mozilla/alt-text-validation)
35
+
36
+
37
+
38
+ You can find the code used to create the model here: https://github.com/mozilla/distilvit
39
+
40
+
41
+ # training results
42
+
43
+ - eval/gen_len 14.99729
44
+ - eval/loss 0.17093
45
+ - eval/meteor 0.51479
46
+ - eval/rouge1 57.8066
47
+ - eval/rouge2 35.0888
48
+ - eval/rougeL 52.9138
49
+ - eval/rougeLsum 52.9101
50
+ - eval/runtime 760.2135
51
+ - eval/samples_per_second 11.18
52
+ - eval/steps_per_second 0.112
53
+ - train/epoch 8.0
54
+ - train/global_step 11752
55
+ - train/learning_rate 0.0
56
+ - train/loss 0.1034
57
+ - train/total_flos 1.518634875573869e+20
58
+ - train/train_loss 0.14875
59
+ - train/train_runtime 91405.9053
60
+ - train/train_samples_per_second 12.855
61
+ - train/train_steps_per_second 0.129
62
+
63
+