vinid commited on
Commit
f3c3055
1 Parent(s): 7c7aaac

update text

Browse files
Files changed (1) hide show
  1. introduction.md +11 -3
introduction.md CHANGED
@@ -86,6 +86,14 @@ but there something wrong; 3: good, however a native speaker might complain abou
86
  The average score was of 3.8 and the two annotators had an inter-rater agreement - computed with [Gwet's AC1](https://bpspsychub.onlinelibrary.wiley.com/doi/full/10.1348/000711006X126600) using ordinal
87
  weighting - of 0.86 (great agreement!).
88
 
 
 
 
 
 
 
 
 
89
  We know that we annotated our own data; in the spirit of fairness we also share the annotations and the captions so
90
  that those interested can check the quality. The Google Sheet is [here](https://docs.google.com/spreadsheets/d/1m6TkcpJbmJlEygL7SXURIq2w8ZHuVvsmdEuCIH0VENk/edit?usp=sharing).
91
 
@@ -110,7 +118,7 @@ Our implementation is available online [here](https://github.com/clip-italian/cl
110
  The ViT used by OpenAI was already trained on 400 million images and it is the element in our architecture that probably required less training.
111
  The same is true for the BERT model we use. To allow the randomly initialized Re-projection Layers to warm up without messing with the tuned weights of the backbones we decided to do a first training with the backbones of our architecture completely frozen. Only after these layers converged we unfreezed the rest of the model to fine-tune all the components. This technique allowed us to reach a much better validation loss.
112
 
113
- <img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/clip-italian.png" alt="drawing" width="90%"/>
114
 
115
  ### Logit Scale
116
 
@@ -123,7 +131,7 @@ We got this idea from Nils' [video](https://youtu.be/RHXZKUr8qOY) on sentence em
123
 
124
  The following picture showcase the effect that this edits have had on our loss:
125
 
126
- <img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/improvements.png" alt="drawing" width="600"/>
127
 
128
  The purple line is the original training, you can see how many steps we needed to get the loss down. Yellow line is the
129
  loss with the new optimizer, it is **striking** to see the time we save from this addition! Blue line shows the results when
@@ -154,7 +162,7 @@ We selected two different tasks:
154
  Both experiments should be very easy to replicate, we share the two colab notebook we used to compute the two results
155
 
156
  + [Image Retrieval](https://colab.research.google.com/drive/1bLVwVKpAndpEDHqjzxVPr_9nGrSbuOQd?usp=sharing)
157
- + [ImageNet Zero Shot Evaluation](https://colab.research.google.com/drive/1zfWeVWY79XXH63Ci-pk8xxx3Vu_RRgW-?usp=sharing)
158
 
159
 
160
  ### Image Retrieval
 
86
  The average score was of 3.8 and the two annotators had an inter-rater agreement - computed with [Gwet's AC1](https://bpspsychub.onlinelibrary.wiley.com/doi/full/10.1348/000711006X126600) using ordinal
87
  weighting - of 0.86 (great agreement!).
88
 
89
+ | English Captions | Italian Captions |
90
+ | ----------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------|
91
+ | an endless cargo of tanks on a train pulled down tracks in an empty dry landscape | un carico infinito di carri armati su un treno trascinato lungo i binari in un paesaggio secco e vuoto |
92
+ | person walking down the aisle | persona che cammina lungo la navata |
93
+ | popular rides at night at the county fair | giostre popolari di notte alla fiera della contea |
94
+
95
+
96
+
97
  We know that we annotated our own data; in the spirit of fairness we also share the annotations and the captions so
98
  that those interested can check the quality. The Google Sheet is [here](https://docs.google.com/spreadsheets/d/1m6TkcpJbmJlEygL7SXURIq2w8ZHuVvsmdEuCIH0VENk/edit?usp=sharing).
99
 
 
118
  The ViT used by OpenAI was already trained on 400 million images and it is the element in our architecture that probably required less training.
119
  The same is true for the BERT model we use. To allow the randomly initialized Re-projection Layers to warm up without messing with the tuned weights of the backbones we decided to do a first training with the backbones of our architecture completely frozen. Only after these layers converged we unfreezed the rest of the model to fine-tune all the components. This technique allowed us to reach a much better validation loss.
120
 
121
+ <img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/clip-italian.png" alt="drawing" width="95%"/>
122
 
123
  ### Logit Scale
124
 
 
131
 
132
  The following picture showcase the effect that this edits have had on our loss:
133
 
134
+ <img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/improvements.png" alt="drawing" width="95%"/>
135
 
136
  The purple line is the original training, you can see how many steps we needed to get the loss down. Yellow line is the
137
  loss with the new optimizer, it is **striking** to see the time we save from this addition! Blue line shows the results when
 
162
  Both experiments should be very easy to replicate, we share the two colab notebook we used to compute the two results
163
 
164
  + [Image Retrieval](https://colab.research.google.com/drive/1bLVwVKpAndpEDHqjzxVPr_9nGrSbuOQd?usp=sharing)
165
+ + [ImageNet Zero Shot Classification](https://colab.research.google.com/drive/1zfWeVWY79XXH63Ci-pk8xxx3Vu_RRgW-?usp=sharing)
166
 
167
 
168
  ### Image Retrieval