Spaces:

clip-italian
/

clip-italian-demo

Running

App Files Files Community

vinid commited on Jul 22, 2021

Commit

f3c3055

1 Parent(s): 7c7aaac

update text

Browse files

Files changed (1) hide show

introduction.md +11 -3

introduction.md CHANGED Viewed

@@ -86,6 +86,14 @@ but there something wrong; 3: good, however a native speaker might complain abou
 The average score was of 3.8 and the two annotators had an inter-rater agreement - computed with [Gwet's AC1](https://bpspsychub.onlinelibrary.wiley.com/doi/full/10.1348/000711006X126600) using ordinal
 weighting - of 0.86 (great agreement!).
 We know that we annotated our own data; in the spirit of fairness we also share the annotations and the captions so
 that those interested can check the quality. The Google Sheet is [here](https://docs.google.com/spreadsheets/d/1m6TkcpJbmJlEygL7SXURIq2w8ZHuVvsmdEuCIH0VENk/edit?usp=sharing).
@@ -110,7 +118,7 @@ Our implementation is available online [here](https://github.com/clip-italian/cl
 The ViT used by OpenAI was already trained on 400 million images and it is the element in our architecture that probably required less training.
 The same is true for the BERT model we use. To allow the randomly initialized Re-projection Layers to warm up without messing with the tuned weights of the backbones we decided to do a first training with the backbones of our architecture completely frozen. Only after these layers converged we unfreezed the rest of the model to fine-tune all the components. This technique allowed us to reach a much better validation loss.
-<img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/clip-italian.png" alt="drawing" width="90%"/>
 ### Logit Scale
@@ -123,7 +131,7 @@ We got this idea from Nils' [video](https://youtu.be/RHXZKUr8qOY) on sentence em
 The following picture showcase the effect that this edits have had on our loss:
-<img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/improvements.png" alt="drawing" width="600"/>
 The purple line is the original training, you can see how many steps we needed to get the loss down. Yellow line is the
 loss with the new optimizer, it is **striking** to see the time we save from this addition! Blue line shows the results when
@@ -154,7 +162,7 @@ We selected two different tasks:
 Both experiments should be very easy to replicate, we share the two colab notebook we used to compute the two results
 + [Image Retrieval](https://colab.research.google.com/drive/1bLVwVKpAndpEDHqjzxVPr_9nGrSbuOQd?usp=sharing)
-+ [ImageNet Zero Shot Evaluation](https://colab.research.google.com/drive/1zfWeVWY79XXH63Ci-pk8xxx3Vu_RRgW-?usp=sharing)
 ### Image Retrieval

 The average score was of 3.8 and the two annotators had an inter-rater agreement - computed with [Gwet's AC1](https://bpspsychub.onlinelibrary.wiley.com/doi/full/10.1348/000711006X126600) using ordinal
 weighting - of 0.86 (great agreement!).
+| English Captions                                                                  | Italian Captions                                                                                        |
+| ----------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------|
+| an endless cargo of tanks on a train pulled down tracks in an empty dry landscape | un carico infinito di carri armati su un treno trascinato lungo i binari in un paesaggio secco e vuoto  |
+| person walking down the aisle                                                     | persona che cammina lungo la navata                                                                     |
+| popular rides at night at the county fair                                         | giostre popolari di notte alla fiera della contea                                                       |
 We know that we annotated our own data; in the spirit of fairness we also share the annotations and the captions so
 that those interested can check the quality. The Google Sheet is [here](https://docs.google.com/spreadsheets/d/1m6TkcpJbmJlEygL7SXURIq2w8ZHuVvsmdEuCIH0VENk/edit?usp=sharing).
 The ViT used by OpenAI was already trained on 400 million images and it is the element in our architecture that probably required less training.
 The same is true for the BERT model we use. To allow the randomly initialized Re-projection Layers to warm up without messing with the tuned weights of the backbones we decided to do a first training with the backbones of our architecture completely frozen. Only after these layers converged we unfreezed the rest of the model to fine-tune all the components. This technique allowed us to reach a much better validation loss.
+<img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/clip-italian.png" alt="drawing" width="95%"/>
 ### Logit Scale
 The following picture showcase the effect that this edits have had on our loss:
+<img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/improvements.png" alt="drawing" width="95%"/>
 The purple line is the original training, you can see how many steps we needed to get the loss down. Yellow line is the
 loss with the new optimizer, it is **striking** to see the time we save from this addition! Blue line shows the results when
 Both experiments should be very easy to replicate, we share the two colab notebook we used to compute the two results
 + [Image Retrieval](https://colab.research.google.com/drive/1bLVwVKpAndpEDHqjzxVPr_9nGrSbuOQd?usp=sharing)
++ [ImageNet Zero Shot Classification](https://colab.research.google.com/drive/1zfWeVWY79XXH63Ci-pk8xxx3Vu_RRgW-?usp=sharing)
 ### Image Retrieval