4rtemi5 commited on
Commit
7377c04
1 Parent(s): 3a77c4a

fixing typos

Browse files
Files changed (1) hide show
  1. introduction.md +12 -13
introduction.md CHANGED
@@ -71,7 +71,7 @@ MSCOCO dataset and have been translated with Microsoft Translator. The 2017 vers
71
 
72
  + [Conceptual Captions](https://ai.google.com/research/ConceptualCaptions/). This image-caption dataset comes from
73
  the work by [Sharma et al., 2018](https://aclanthology.org/P18-1238.pdf). There are more than 3mln image-caption pairs in
74
- this dataset and these have been collected from the web. We downloaded the images with the URLs provided by the dataset, but we
75
  could not retrieve them all. Eventually, we had to translate the captions to Italian. We have been able to collect
76
  a dataset with 700K translated captions.
77
 
@@ -83,14 +83,14 @@ Each photo comes along with an Italian caption.
83
  ### A Note on Translations
84
 
85
  Instead of relying on open-source translators, we decided to use DeepL. **Translation quality** of the data was the main
86
- reason of this choice. With the few images (wrt OpenAI) that we have, we cannot risk polluting our own data. CC is a great resource
87
  but the captions have to be handled accordingly. We translated 700K captions and we evaluated their quality:
88
 
89
  Three of us looked at a sample of 100 of the translations and rated them with scores from 1 to 4.
90
- The meaning of the value is as follows: 1, the sentence has lost is meaning or it's not possible to understand it; 2, it is possible to get the idea
91
- but there something wrong; 3, good, however a native speaker might complain about some translations; 4, good translation.
92
 
93
- The average score was of 3.78 and the three annotators had an inter-rater agreement - computed with [Gwet's AC1](https://bpspsychub.onlinelibrary.wiley.com/doi/full/10.1348/000711006X126600) using ordinal
94
  weighting - of 0.858 (great agreement!).
95
 
96
  | English Captions | Italian Captions |
@@ -99,7 +99,6 @@ weighting - of 0.858 (great agreement!).
99
  | person walking down the aisle | persona che cammina lungo la navata |
100
  | popular rides at night at the county fair | giostre popolari di notte alla fiera della contea |
101
 
102
- \t\t\t
103
  We know that we annotated our own data; in the spirit of fairness we also share the annotations and the captions so
104
  that those interested can check the quality. The Google Sheet is [here](https://docs.google.com/spreadsheets/d/1m6TkcpJbmJlEygL7SXURIq2w8ZHuVvsmdEuCIH0VENk/edit?usp=sharing).
105
 
@@ -113,7 +112,7 @@ While we would have liked to have augmentations for the captions as well, after
113
 
114
  After different trials, we realized that the usual way of training this model was
115
  not good enough to get good results. We thus modified three different parts of the
116
- training pipeline: the optimizer, the training with frozen components and the logit_scale parameter.
117
 
118
  ### Optimizer
119
 
@@ -124,9 +123,9 @@ Our implementation is available online [here](https://github.com/clip-italian/cl
124
 
125
  ### Backbone Freezing
126
 
127
- The ViT used by OpenAI was already trained on 400 million images and it is the element in our architecture that probably required the least training.
128
  The same is true for the BERT model we use. To allow the randomly initialized re-projection layers to warm up without messing with the tuned weights of the backbones, we decided to do a first training with the backbones of our architecture completely frozen.
129
- Only after these layers converged we unfreezed the rest of the model to fine-tune all the components. This technique allowed us to reach a much better validation loss.
130
 
131
  <img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/clip-italian.png" alt="drawing" width="95%"/>
132
 
@@ -146,14 +145,14 @@ The following picture showcases the effect that these edits have had on our eval
146
  The purple line is the original training without any of our improvements: you can see that we needed a lot of training steps to get the loss down.
147
  Yellow line is the loss with the new optimizer, it is **striking** to see the time we save from this addition! Not only the loss improves, it
148
  also converges significantly faster! The blue line shows the results when
149
- fixed scaling is used in addition to the new optimizer. Finally, we added the backbone freezing strategy and you can see the
150
  results in the light blue loss. Nonetheless, as common in deep learning, having more data played a big role and was another key element
151
  to reduce the loss.
152
 
153
 
154
  # Scientific Validity
155
 
156
- We split this section in two: we first provide a quantitative evaluation to ensure that what we are learning is really good.
157
  We then show some qualitative examples of images found by the model. **All the code we have written** to run our validation experiments (in combination with
158
  code made available by Nils Reimers and by the authors of the original CLIP) is available.
159
 
@@ -195,7 +194,7 @@ described by the original caption. As evaluation metrics we use the MRR@K.
195
  | MRR@5 | **0.5039** | 0.3957|
196
  | MRR@10 | **0.5204** | 0.4129|
197
 
198
- _If the table above doesn not show, you can have a look at it [here](https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/table_imagenet.png)._
199
 
200
  It is true that we used the training set of MSCOCO-IT in training, and this might give us an advantage. However, the original CLIP model was trained
201
  on 400million images (and some of them might have been from MSCOCO).
@@ -238,7 +237,7 @@ Look at the following - slightly cherry picked - examples:
238
  Here's "a yellow flower"
239
  <img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/fiore_giallo.png" alt="drawing" width="600"/>
240
 
241
- And here's "a blu flower"
242
  <img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/fiore_blu.png" alt="drawing" width="600"/>
243
 
244
  ### Counting
 
71
 
72
  + [Conceptual Captions](https://ai.google.com/research/ConceptualCaptions/). This image-caption dataset comes from
73
  the work by [Sharma et al., 2018](https://aclanthology.org/P18-1238.pdf). There are more than 3mln image-caption pairs in
74
+ this dataset that have been collected from the web. We downloaded the images with the URLs provided by the dataset, but we
75
  could not retrieve them all. Eventually, we had to translate the captions to Italian. We have been able to collect
76
  a dataset with 700K translated captions.
77
 
 
83
  ### A Note on Translations
84
 
85
  Instead of relying on open-source translators, we decided to use DeepL. **Translation quality** of the data was the main
86
+ reason of this choice. With the few images (wrt OpenAI) that we have, we cannot risk polluting our own data. CC is a great resource,
87
  but the captions have to be handled accordingly. We translated 700K captions and we evaluated their quality:
88
 
89
  Three of us looked at a sample of 100 of the translations and rated them with scores from 1 to 4.
90
+ The meaning of the value is as follows: 1, the sentence has lost is meaning, or it's not possible to understand it; 2, it is possible to get the idea
91
+ but there is something wrong; 3, good, however a native speaker might complain about some translations; 4, good translation.
92
 
93
+ The average score was of 3.78, and the three annotators had an inter-rater agreement - computed with [Gwet's AC1](https://bpspsychub.onlinelibrary.wiley.com/doi/full/10.1348/000711006X126600) using ordinal
94
  weighting - of 0.858 (great agreement!).
95
 
96
  | English Captions | Italian Captions |
 
99
  | person walking down the aisle | persona che cammina lungo la navata |
100
  | popular rides at night at the county fair | giostre popolari di notte alla fiera della contea |
101
 
 
102
  We know that we annotated our own data; in the spirit of fairness we also share the annotations and the captions so
103
  that those interested can check the quality. The Google Sheet is [here](https://docs.google.com/spreadsheets/d/1m6TkcpJbmJlEygL7SXURIq2w8ZHuVvsmdEuCIH0VENk/edit?usp=sharing).
104
 
 
112
 
113
  After different trials, we realized that the usual way of training this model was
114
  not good enough to get good results. We thus modified three different parts of the
115
+ training pipeline: the optimizer, the training with frozen components, and the fixed logit_scale parameter.
116
 
117
  ### Optimizer
118
 
 
123
 
124
  ### Backbone Freezing
125
 
126
+ The ViT used by OpenAI was already trained on 400 million images, and it is the element in our architecture that probably requires the least amount of training.
127
  The same is true for the BERT model we use. To allow the randomly initialized re-projection layers to warm up without messing with the tuned weights of the backbones, we decided to do a first training with the backbones of our architecture completely frozen.
128
+ Only after these layers converged we unfroze the rest of the model to fine-tune all the components. This technique allowed us to reach a much better validation loss.
129
 
130
  <img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/clip-italian.png" alt="drawing" width="95%"/>
131
 
 
145
  The purple line is the original training without any of our improvements: you can see that we needed a lot of training steps to get the loss down.
146
  Yellow line is the loss with the new optimizer, it is **striking** to see the time we save from this addition! Not only the loss improves, it
147
  also converges significantly faster! The blue line shows the results when
148
+ fixed scaling is used in addition to the new optimizer. Finally, we added the backbone freezing strategy, and you can see the
149
  results in the light blue loss. Nonetheless, as common in deep learning, having more data played a big role and was another key element
150
  to reduce the loss.
151
 
152
 
153
  # Scientific Validity
154
 
155
+ We split this section in two: we first provide a quantitative evaluation to ensure that what we are learning is in fact good.
156
  We then show some qualitative examples of images found by the model. **All the code we have written** to run our validation experiments (in combination with
157
  code made available by Nils Reimers and by the authors of the original CLIP) is available.
158
 
 
194
  | MRR@5 | **0.5039** | 0.3957|
195
  | MRR@10 | **0.5204** | 0.4129|
196
 
197
+ _If the table above does not show, you can have a look at it [here](https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/table_imagenet.png)._
198
 
199
  It is true that we used the training set of MSCOCO-IT in training, and this might give us an advantage. However, the original CLIP model was trained
200
  on 400million images (and some of them might have been from MSCOCO).
 
237
  Here's "a yellow flower"
238
  <img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/fiore_giallo.png" alt="drawing" width="600"/>
239
 
240
+ And here's "a blue flower"
241
  <img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/fiore_blu.png" alt="drawing" width="600"/>
242
 
243
  ### Counting