Mozilla
/

distilvit

@@ -1,78 +1,63 @@
----
-tags:
-  - image-to-text
-  - image-captioning
-license: apache-2.0
-metrics:
-  - rouge
-datasets:
-  - Mozilla/flickr30k-transformed-captions
-widget:
-  - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/savanna.jpg
-    example_title: Savanna
-  - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/football-match.jpg
-    example_title: Football Match
-  - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/airport.jpg
-    example_title: Airport
-base_model:
-  - google/vit-base-patch16-224-in21k
-model-index:
-  - name: mozilla/distilvit
-    results:
-      - task:
-          type: image-to-text
-          name: Image To Text
-        dataset:
-          name: Mozilla/flickr30k-transformed-captions
-          type: Mozilla/flickr30k-transformed-captions
-        metrics:
-          - name: ROUGE-1
-            type: rouge
-            value: 43.006
-            verified: true
-          - name: ROUGE-2
-            type: rouge
-            value: 16.9939
-            verified: true
-          - name: ROUGE-L
-            type: rouge
-            value: 38.8923
-            verified: true
-          - name: ROUGE-LSUM
-            type: rouge
-            value: 38.8877
-            verified: true
-          - name: loss
-            type: loss
-            value: 0.19939416646957397
-          - name: gen_len
-            type: gen_len
-            value: 11.327256736227712
-            verified: true
----
-# distilvit
-This model is a work in progress. Fine-tuned version of those base models:
-- a VIT model for the image encoder: https://huggingface.co/google/vit-base-patch16-224-in21k
-- a Distilled GPT-2 model for the text decoder: https://huggingface.co/distilbert/distilgpt2
-This model was trained on:
-- Flickr30k : https://huggingface.co/datasets/nlphuji/flickr30k
-- COCO 2017: https://cocodataset.org
-You can get that checkpoint using the 3083a3cef6e3c8dd90df3f088074bbe836b0f403 commit.
-It was then further fine-tuned on :
-- [Flickr30k debiased](https://huggingface.co/datasets/Mozilla/flickr30k-transformed-captions)
-- [DocOrNot](https://huggingface.co/datasets/Mozilla/docornot)
-- [Alt Text Validation](https://huggingface.co/datasets/Mozilla/alt-text-validation)
-For the latter, the dataset was annotated by our team to correct the alt text generated by the model,
-using the [checkvite tool](https://github.com/mozila/checkvite).
-You can find the code used to create the model here: https://github.com/mozilla/distilvit

+---
+tags:
+  - image-to-text
+  - image-captioning
+license: apache-2.0
+metrics:
+  - rouge
+datasets:
+  - Mozilla/flickr30k-transformed-captions-gpt4o
+widget:
+  - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/savanna.jpg
+    example_title: Savanna
+  - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/football-match.jpg
+    example_title: Football Match
+  - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/airport.jpg
+    example_title: Airport
+base_model:
+  - google/vit-base-patch16-224-in21k
+---
+# distilvit
+This model is a work in progress. Fine-tuned version of those base models:
+- a VIT model for the image encoder: https://huggingface.co/google/vit-base-patch16-224-in21k
+- a Distilled GPT-2 model for the text decoder: https://huggingface.co/distilbert/distilgpt2
+This model was trained on:
+- [A debiased version of COCO 2017](https://huggingface.co/datasets/Mozilla/coco-gpt4o)
+- [A debiased version of Flickr30k](https://huggingface.co/datasets/Mozilla/flickr30k-transformed-captions-gpt4o)
+- [Images from pexels](https://huggingface.co/datasets/Mozilla/pexels-gpt4o)
+- [DocOrNot](https://huggingface.co/datasets/Mozilla/docornot)
+- [Alt Text Validation](https://huggingface.co/datasets/Mozilla/alt-text-validation)
+You can find the code used to create the model here: https://github.com/mozilla/distilvit
+# training results
+- eval/gen_len 14.99729
+- eval/loss 0.17093
+- eval/meteor 0.51479
+- eval/rouge1 57.8066
+- eval/rouge2 35.0888
+- eval/rougeL 52.9138
+- eval/rougeLsum 52.9101
+- eval/runtime 760.2135
+- eval/samples_per_second 11.18
+- eval/steps_per_second 0.112
+- train/epoch 8.0
+- train/global_step 11752
+- train/learning_rate 0.0
+- train/loss 0.1034
+- train/total_flos 1.518634875573869e+20
+- train/train_loss 0.14875
+- train/train_runtime 91405.9053
+- train/train_samples_per_second 12.855
+- train/train_steps_per_second 0.129