vinid commited on
Commit
f882247
1 Parent(s): dbb7dfd

fixing a few things

Browse files
Files changed (1) hide show
  1. introduction.md +54 -34
introduction.md CHANGED
@@ -1,14 +1,17 @@
1
  # Italian CLIP
2
 
3
- With a few tricks, we have been able to fine-tune a competitive Italian CLIP model with **only 1.4 million** training samples. Our Italian CLIP model
4
- is built upon the [Italian BERT](https://huggingface.co/dbmdz/bert-base-italian-xxl-cased) model provided by [dbmdz](https://huggingface.co/dbmdz) and the OpenAI
 
 
 
5
  [vision transformer](https://huggingface.co/openai/clip-vit-base-patch32).
6
 
7
  In building this project we kept in mind the following principles:
8
 
9
- + **Novel Contributions**: We created a dataset of ~1.4 million Italian image-text pairs and, to the best of our knowledge, we trained the best Italian CLIP model currently in existence;
10
- + **Scientific Validity**: Claim are easy, facts are hard. That's why validation is important to assess the real impact of a model. We thoroughly evaluated our models in several tasks and made the validation reproducible for everybody.
11
- + **Broader Outlook**: We always kept in mind which are the possible usages for this model.
12
 
13
  We put our **hearts** and **souls** into the project during this week! Not only did we work on a cool project, but we were
14
  able to make new friends and and learn a lot from each other to work towards a common goal!
@@ -25,7 +28,7 @@ have the highest similarity with the text query.
25
  + *Image to Text*: This task is essentially a zero-shot image classification task. The user is asked for an image and for a set of captions/labels and CLIP
26
  is going to compute the similarity between the image and each label. The webapp is going to display a probability distribution over the captions.
27
 
28
- + *Examples and Applications*: This page showcases some interesting results we got from the model, we believe that there are
29
  different applications that can start from here.
30
 
31
  # Novel Contributions
@@ -50,8 +53,8 @@ Thus, we tried to add as much data as possible while keeping the data-quality as
50
  We considered four main sources of data:
51
 
52
  + [WIT](https://github.com/google-research-datasets/wit) is an image-caption dataset collected from Wikipedia (see,
53
- [Srinivasan et al., 2021](https://arxiv.org/pdf/2103.01913.pdf)). We focused on the *Reference Description* captions described in the paper as they are
54
- the ones of highest quality. Nonetheless, many of these captions describe ontological knowledge and encyclopedic facts (e.g., Roberto Baggio in 1994).
55
  However, this kind of text, without more information, is not useful to learn a good mapping between images and captions.
56
  On the other hand, this text is written in Italian and it is of good quality. We cannot just remove short captions as some of those
57
  are still good (e.g., "running dog"). Thus, to prevent polluting the data with captions that are not meaningful, we used *POS tagging*
@@ -60,7 +63,7 @@ However, this kind of text, without more information, is not useful to learn a g
60
 
61
  Captions like: *'Dora Riparia', 'Anna Maria Mozzoni', 'Joey Ramone Place', 'Kim Rhodes', 'Ralph George Hawtrey' * have been removed.
62
 
63
- + [MSCOCO-IT](https://github.com/crux82/mscoco-it). This image-caption dataset comes from the work by [Scaiella et al., 2019](http://www.ai-lc.it/IJCoL/v5n2/IJCOL_5_2_3___scaiella_et_al.pdf). The captions comes from the original
64
  MSCOCO dataset and have been translated with Microsoft Translator. The 2017 version of the MSCOCO training set contains more than
65
  100K images, for each image more than one caption is available.
66
 
@@ -126,35 +129,43 @@ didn't go well. Eventually, the thing that worked out the best was fixing the lo
126
  is used after the computation of the similarity between the images and the texts in CLIP (see the code [here](https://github.com/clip-italian/clip-italian/blob/master/hybrid_clip/modeling_hybrid_clip.py#L64)).
127
  We got this idea from Nils' [video](https://youtu.be/RHXZKUr8qOY) on sentence embeddings.
128
 
129
- ### Effect
130
 
131
- The following picture showcase the effect that these edits have had on our loss:
132
 
133
  <img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/improvements.png" alt="drawing" width="95%"/>
134
 
135
- The purple line is the original training, you can see how many steps we needed to get the loss down. Yellow line is the
136
- loss with the new optimizer, it is **striking** to see the time we save from this addition! Blue line shows the results when
 
137
  fixed scaling is added with the new optimization. Finally, we added the backbone freezing part and you can see the
138
- results in the light blue loss.
 
139
 
140
 
141
  # Scientific Validity
142
 
 
 
 
 
143
  ## Quantitative Evaluation
144
  Those images are definitely cool and interesting, but a model is nothing without validation.
145
- To better understand how well our clip-italian model works we run an experimental evaluation. Since this is the first clip-based model in Italian, we used the multilingual CLIP model as a comparison baseline.
146
 
147
  ### mCLIP
148
 
149
  The multilingual CLIP (henceforth, mCLIP), is a model introduced by [Nils Reimers](https://www.sbert.net/docs/pretrained_models.html) in his
150
  [sentence-transformer](https://www.sbert.net/index.html) library. mCLIP is based on a multilingual encoder
151
- that was created through multilingual knowledge distillation (see [Reimers et al., 2020](https://aclanthology.org/2020.emnlp-main.365/)).
 
152
 
153
  ### Tasks
154
 
155
  We selected two different tasks:
156
- + image-retrieval
157
- + zero-shot classification
 
158
 
159
  ### Reproducibiliy
160
 
@@ -167,8 +178,8 @@ Both experiments should be very easy to replicate, we share the two colab notebo
167
  ### Image Retrieval
168
 
169
  This experiment is run against the MSCOCO-IT validation set (that we haven't used in training). Given in input
170
- a caption, we search for the most similar image in the MSCOCO-IT validation set. As evaluation metrics
171
- we use the MRR@K.
172
 
173
  | MRR | CLIP-Italian | mCLIP |
174
  | --------------- | ------------ |-------|
@@ -176,16 +187,14 @@ we use the MRR@K.
176
  | MRR@5 | **0.5039** | 0.3957|
177
  | MRR@10 | **0.5204** | 0.4129|
178
 
179
- It is true that we used MSCOCO-IT in training, and this might give us an advantage. However the original CLIP model was trained
180
  on 400million images (and some of them probably were from MSCOCO).
181
 
182
-
183
  ### Zero-shot image classification
184
 
185
  This experiment replicates the original one run by OpenAI on zero-shot image classification on ImageNet.
186
  To do this, we used DeepL to translate the image labels in ImageNet. We evaluate the models computing the accuracy at different levels.
187
 
188
-
189
  | Accuracy | CLIP-Italian | mCLIP |
190
  | --------------- | ------------ |-------|
191
  | Accuracy@1 | **22.11** | 20.15 |
@@ -193,17 +202,23 @@ To do this, we used DeepL to translate the image labels in ImageNet. We evaluate
193
  | Accuracy@10 | **52.55** | 42.91 |
194
  | Accuracy@100 | **81.08** | 67.11 |
195
 
 
 
196
  Our results confirm that CLIP-Italian is very competitive and beats mCLIP on the two different task
197
  we have been testing. Note, however, that our results are lower than those shown in the original OpenAI
198
- paper (see, [Radford et al., 2021](https://arxiv.org/abs/2103.00020)). However, considering that our results are in line with those obtained by mCLIP we think that
199
- the translated image labels might have had an impact on the final scores.
200
-
201
-
202
 
203
  ## Qualitative Evaluation
204
 
205
  We hereby show some very interesting properties of the model. One is its ability to detect colors,
206
- then there is its (partial) counting ability and finally the ability of understanding more complex quries. To our own surprise, many of the answers the model gives make a lot of sense!
 
 
 
 
 
207
  Look at the following - slightly cherry picked (but not even that much) - examples:
208
 
209
  ### Colors
@@ -241,14 +256,20 @@ early 1900 and it is part of the largest movie studios in Europe (Cinecittà).
241
  # Limitations and Bias
242
 
243
  Currently, the model is not without limits. To mention one, its counting capabilities seem very cool, but from our experiments the model
244
- finds difficult to count after three; this is a general limitation.
245
- There are even more serious limitations: we found some emergence of biases and stereotypes that got in our model from different factors: searching for "una troia" ("a bitch") on the
246
- CC dataset shows the picture of a woman. The model's capability even increase this issue, as searching for "due troie" ("two bitches")
 
247
  gives again, as a results, the picture of two women. BERT models are not free from bias. Indeed, different BERT models - Italians included - are prone to create stereotyped sentences that are hurtful ([Nozza et al., 2021](https://www.aclweb.org/anthology/2021.naacl-main.191.pdf))
248
-
249
- This issue is common to many machine learning algorithms (check [Abit et al., 2021](https://arxiv.org/abs/2101.05783) for bias in GPT-3 as an example) and
250
  suggest we need to work even harder on this problem that affects our **society**.
251
 
 
 
 
 
 
252
  # References
253
 
254
  Abid, A., Farooqi, M., & Zou, J. (2021). [Persistent anti-muslim bias in large language models.](https://arxiv.org/abs/2101.05783) arXiv preprint arXiv:2101.05783.
@@ -267,6 +288,5 @@ Sharma, P., Ding, N., Goodman, S., & Soricut, R. (2018, July). [Conceptual capti
267
 
268
  Srinivasan, K., Raman, K., Chen, J., Bendersky, M., & Najork, M. (2021). [WIT: Wikipedia-based image text dataset for multimodal multilingual machine learning](https://arxiv.org/pdf/2103.01913.pdf). arXiv preprint arXiv:2103.01913.
269
 
270
-
271
  # Other Notes
272
  This readme has been designed using resources from Flaticon.com
 
1
  # Italian CLIP
2
 
3
+ CLIP [Radford et al., 2021](https://arxiv.org/abs/2103.00020) is an amazing model that can learn to represent images and text jointly in the samp space.
4
+
5
+ In this project, we aim to propose the first CLIP model trained on Italian data, that in this context can be considered a
6
+ low resource language. Using a few smart techniques, we have been able to fine-tune a SOTA Italian CLIP model with **only 1.4 million** training samples. Our Italian CLIP model
7
+ is built upon the pre-trained [Italian BERT](https://huggingface.co/dbmdz/bert-base-italian-xxl-cased) model provided by [dbmdz](https://huggingface.co/dbmdz) and the OpenAI
8
  [vision transformer](https://huggingface.co/openai/clip-vit-base-patch32).
9
 
10
  In building this project we kept in mind the following principles:
11
 
12
+ + **Novel Contributions**: We created a dataset of ~1.4 million Italian image-text pairs (**that we will share with the community**) and, to the best of our knowledge, we trained the best Italian CLIP model currently in existence;
13
+ + **Scientific Validity**: Claim are easy, facts are hard. That's why validation is important to assess the real impact of a model. We thoroughly evaluated our models on two tasks and made the validation reproducible for everybody.
14
+ + **Broader Outlook**: We always kept in mind which are the possible usages and limitations of this model.
15
 
16
  We put our **hearts** and **souls** into the project during this week! Not only did we work on a cool project, but we were
17
  able to make new friends and and learn a lot from each other to work towards a common goal!
 
28
  + *Image to Text*: This task is essentially a zero-shot image classification task. The user is asked for an image and for a set of captions/labels and CLIP
29
  is going to compute the similarity between the image and each label. The webapp is going to display a probability distribution over the captions.
30
 
31
+ + *Examples & Applications*: This page showcases some interesting results we got from the model, we believe that there are
32
  different applications that can start from here.
33
 
34
  # Novel Contributions
 
53
  We considered four main sources of data:
54
 
55
  + [WIT](https://github.com/google-research-datasets/wit) is an image-caption dataset collected from Wikipedia (see,
56
+ [Srinivasan et al., 2021](https://arxiv.org/pdf/2103.01913.pdf)). We focused on the *Reference Description* captions
57
+ described in the paper as they are the ones of highest quality. Nonetheless, many of these captions describe ontological knowledge and encyclopedic facts (e.g., Roberto Baggio in 1994).
58
  However, this kind of text, without more information, is not useful to learn a good mapping between images and captions.
59
  On the other hand, this text is written in Italian and it is of good quality. We cannot just remove short captions as some of those
60
  are still good (e.g., "running dog"). Thus, to prevent polluting the data with captions that are not meaningful, we used *POS tagging*
 
63
 
64
  Captions like: *'Dora Riparia', 'Anna Maria Mozzoni', 'Joey Ramone Place', 'Kim Rhodes', 'Ralph George Hawtrey' * have been removed.
65
 
66
+ + [MSCOCO-IT](https://github.com/crux82/mscoco-it). This image-caption dataset comes from the work by [Scaiella et al., 2019](http://www.ai-lc.it/IJCoL/v5n2/IJCOL_5_2_3___scaiella_et_al.pdf). The captions come from the original
67
  MSCOCO dataset and have been translated with Microsoft Translator. The 2017 version of the MSCOCO training set contains more than
68
  100K images, for each image more than one caption is available.
69
 
 
129
  is used after the computation of the similarity between the images and the texts in CLIP (see the code [here](https://github.com/clip-italian/clip-italian/blob/master/hybrid_clip/modeling_hybrid_clip.py#L64)).
130
  We got this idea from Nils' [video](https://youtu.be/RHXZKUr8qOY) on sentence embeddings.
131
 
132
+ ### Effect of Our Edits
133
 
134
+ The following picture showcase the effect that these edits have had on our evaluation loss:
135
 
136
  <img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/improvements.png" alt="drawing" width="95%"/>
137
 
138
+ The purple line is the original training without any of our improvements, you can see how many steps we needed to get the loss down.
139
+ Yellow line is the loss with the new optimizer, it is **striking** to see the time we save from this addition! Not only the loss improves, it
140
+ also converges much faster! Blue line shows the results when
141
  fixed scaling is added with the new optimization. Finally, we added the backbone freezing part and you can see the
142
+ results in the light blue loss. Nonetheless, as common in deep learning, having more data played a big role and was another key element
143
+ to reduce the loss.
144
 
145
 
146
  # Scientific Validity
147
 
148
+ We split this section in two: we first provide a quantitative evaluation to ensure that what we are learning is good and we then
149
+ show some qualitative examples of images found by the model. **All the code we have written** to run our experiments (in combination with
150
+ code made available by Nils Reimers and by the authors of the original CLIP) is available.
151
+
152
  ## Quantitative Evaluation
153
  Those images are definitely cool and interesting, but a model is nothing without validation.
154
+ Since this is the first clip-based model in Italian, we decided to use the multilingual CLIP model as a comparison baseline.
155
 
156
  ### mCLIP
157
 
158
  The multilingual CLIP (henceforth, mCLIP), is a model introduced by [Nils Reimers](https://www.sbert.net/docs/pretrained_models.html) in his
159
  [sentence-transformer](https://www.sbert.net/index.html) library. mCLIP is based on a multilingual encoder
160
+ that was created through multilingual knowledge distillation (see [Reimers et al., 2020](https://aclanthology.org/2020.emnlp-main.365/)). It shows
161
+ great capabilities in representing multilingual text in the same space of the images.
162
 
163
  ### Tasks
164
 
165
  We selected two different tasks:
166
+ + image-retrieval, in which given a caption the model finds the most similar image
167
+ + zero-shot classification, in which given an image and a set of captions (or labels), the model finds
168
+ the best matching caption for the image
169
 
170
  ### Reproducibiliy
171
 
 
178
  ### Image Retrieval
179
 
180
  This experiment is run against the MSCOCO-IT validation set (that we haven't used in training). Given in input
181
+ a caption from the dataset, we search for the most similar image in the MSCOCO-IT validation set and check if this is the one that was
182
+ described by the original caption. As evaluation metrics we use the MRR@K.
183
 
184
  | MRR | CLIP-Italian | mCLIP |
185
  | --------------- | ------------ |-------|
 
187
  | MRR@5 | **0.5039** | 0.3957|
188
  | MRR@10 | **0.5204** | 0.4129|
189
 
190
+ It is true that we used MSCOCO-IT in training, and this might give us an advantage. However, the original CLIP model was trained
191
  on 400million images (and some of them probably were from MSCOCO).
192
 
 
193
  ### Zero-shot image classification
194
 
195
  This experiment replicates the original one run by OpenAI on zero-shot image classification on ImageNet.
196
  To do this, we used DeepL to translate the image labels in ImageNet. We evaluate the models computing the accuracy at different levels.
197
 
 
198
  | Accuracy | CLIP-Italian | mCLIP |
199
  | --------------- | ------------ |-------|
200
  | Accuracy@1 | **22.11** | 20.15 |
 
202
  | Accuracy@10 | **52.55** | 42.91 |
203
  | Accuracy@100 | **81.08** | 67.11 |
204
 
205
+ ### Discussion
206
+
207
  Our results confirm that CLIP-Italian is very competitive and beats mCLIP on the two different task
208
  we have been testing. Note, however, that our results are lower than those shown in the original OpenAI
209
+ paper (see, [Radford et al., 2021](https://arxiv.org/abs/2103.00020)) that was evaluated on English data.
210
+ However, considering that our results are in line with those obtained by mCLIP we think that the translated image
211
+ labels might have had an impact on the final scores.
 
212
 
213
  ## Qualitative Evaluation
214
 
215
  We hereby show some very interesting properties of the model. One is its ability to detect colors,
216
+ then there is its (partial) counting ability and finally the ability of understanding more complex queries. You can find
217
+ more examples in the "*Examples & Applications*" section of this demo.
218
+
219
+ To our own surprise, many of the answers the model gives make a lot of sense! Note that the model, in this case,
220
+ is searching the right image from a set of 25K images from an Unsplash dataset.
221
+
222
  Look at the following - slightly cherry picked (but not even that much) - examples:
223
 
224
  ### Colors
 
256
  # Limitations and Bias
257
 
258
  Currently, the model is not without limits. To mention one, its counting capabilities seem very cool, but from our experiments the model
259
+ finds difficult to count after three; this is a general limitation that is common to many model like this.
260
+
261
+ There are even more evident issues: we found some emergence of biases and stereotypes that got in our model from different factors:
262
+ searching for "una troia" ("a bitch") on the CC dataset shows the picture of a woman. The model's capability even increase this issue, as searching for "due troie" ("two bitches")
263
  gives again, as a results, the picture of two women. BERT models are not free from bias. Indeed, different BERT models - Italians included - are prone to create stereotyped sentences that are hurtful ([Nozza et al., 2021](https://www.aclweb.org/anthology/2021.naacl-main.191.pdf))
264
+
265
+ Unfortunately, this kind of issues is common to many machine learning algorithms (check [Abit et al., 2021](https://arxiv.org/abs/2101.05783) for bias in GPT-3 as an example) and
266
  suggest we need to work even harder on this problem that affects our **society**.
267
 
268
+ # Useful Links
269
+
270
+ + [GitHub Repository](https://github.com/clip-italian/clip-italian)
271
+ + [Model on HuggingFace](https://huggingface.co/clip-italian/clip-italian)
272
+
273
  # References
274
 
275
  Abid, A., Farooqi, M., & Zou, J. (2021). [Persistent anti-muslim bias in large language models.](https://arxiv.org/abs/2101.05783) arXiv preprint arXiv:2101.05783.
 
288
 
289
  Srinivasan, K., Raman, K., Chen, J., Bendersky, M., & Najork, M. (2021). [WIT: Wikipedia-based image text dataset for multimodal multilingual machine learning](https://arxiv.org/pdf/2103.01913.pdf). arXiv preprint arXiv:2103.01913.
290
 
 
291
  # Other Notes
292
  This readme has been designed using resources from Flaticon.com