4rtemi5 commited on
Commit
dc140b8
1 Parent(s): df31557

update introduction, small fixes, typos, bias discussion

Browse files
Files changed (1) hide show
  1. introduction.md +52 -43
introduction.md CHANGED
@@ -1,9 +1,9 @@
1
  # Italian CLIP
2
 
3
- CLIP ([Radford et al., 2021](https://arxiv.org/abs/2103.00020)) is an amazing model that can learn to represent images and text jointly in the samp space.
4
 
5
  In this project, we aim to propose the first CLIP model trained on Italian data, that in this context can be considered a
6
- low resource language. Using a few smart techniques, we have been able to fine-tune a SOTA Italian CLIP model with **only 1.4 million** training samples. Our Italian CLIP model
7
  is built upon the pre-trained [Italian BERT](https://huggingface.co/dbmdz/bert-base-italian-xxl-cased) model provided by [dbmdz](https://huggingface.co/dbmdz) and the OpenAI
8
  [vision transformer](https://huggingface.co/openai/clip-vit-base-patch32).
9
 
@@ -14,8 +14,8 @@ In building this project we kept in mind the following principles:
14
  + **Broader Outlook**: We always kept in mind which are the possible usages and limitations of this model.
15
 
16
  We put our **hearts** and **souls** into the project during this week! Not only did we work on a cool project, but we were
17
- able to make new friends and and learn a lot from each other to work towards a common goal!
18
- Thank you for this amazing opportunity, we hope you will like the results. :heart:
19
 
20
  # Demo
21
 
@@ -33,15 +33,15 @@ different applications that can start from here.
33
 
34
  # Novel Contributions
35
 
36
- The original CLIP model was trained on 400 million image-text pairs; this amount of data is not available for Italian.
37
  We indeed worked in a **low-resource setting**. The only datasets for Italian captioning in the literature are MSCOCO-IT (a translated version of MSCOCO) and WIT.
38
  To get competitive results we followed three strategies:
39
  1. more and better data;
40
  2. better augmentations;
41
- 3. better training.
42
 
43
  For those interested, we have a :comet: [Comet](https://www.comet.ml/g8a9/clip-italian/reports/clip-italian-training-metrics) report
44
- that shows a **subset** of the experiments we run. Different hyper-parameters played a role in reducing the validation
45
  loss. The optimizer we used gave us great performance and huge conversion speed, more data and augmentations helped a lot in generalizing,
46
  working on the training and on the loss gave us the final increase that you can see in the results.
47
 
@@ -73,7 +73,9 @@ this dataset and these have been collected from the web. We downloaded the image
73
  could not retrieve them all. Eventually, we had to translate the captions to Italian. We have been able to collect
74
  a dataset with 700K translated captions.
75
 
76
- + [La Foto del Giorno](https://www.ilpost.it/foto-del-giorno/). This image-caption dataset is collected from [Il Post](https://www.ilpost.it/), a prominent Italian online newspaper. The collection contains almost 30K pairs: starting from early 2011, for each day, editors at Il Post pick several images picturing the most salient events in the world. Each photo comes along with an Italian caption.
 
 
77
 
78
 
79
  ### A Note on Translations
@@ -101,7 +103,8 @@ that those interested can check the quality. The Google Sheet is [here](https://
101
 
102
  ## Better Augmentations
103
 
104
- We knew that without a good augmentation strategy we could never get competitive results to a model trained on 400 million images. Therefor we implemented heavy augmentations to make the training more data efficient. They include random affine transformations and perspective changes, as well as occasional equalization and random changes to brightness, contrast, saturation and hue. We made sure to keep hue augmentations limited however to still give the model the ability to learn color definitions.
 
105
  While we would have liked to have augmentations for the captions as well after some experimentation we settled with random sampling from the five captions available in MSCOCO and leaving the rest of the captions unmodified.
106
 
107
  ## Better Training
@@ -112,20 +115,23 @@ training pipeline: the optimizer, the training with frozen components and the lo
112
 
113
  ### Optimizer
114
 
115
- While the initial code used AdamW as an optimizer we soon noticed that it introduced some bad properties into the training. The model strated to overfit relatively quickly and the weight decay made this effect worse. We eventually decided to an optimization strategy that had worked well for us in similar cases and used AdaBelief with Adaptive Gradient Clipping (AGC) and a Cosine Annealing Schedule. Together with slightly tuning the learning rate this helped us to reduce the validation loss by 25%.
 
 
116
  Our implementation is available online [here](https://github.com/clip-italian/clip-italian/blob/master/hybrid_clip/run_hybrid_clip.py#L667).
117
 
118
  ### Backbone Freezing
119
 
120
- The ViT used by OpenAI was already trained on 400 million images and it is the element in our architecture that probably required less training.
121
- The same is true for the BERT model we use. To allow the randomly initialized Re-projection Layers to warm up without messing with the tuned weights of the backbones we decided to do a first training with the backbones of our architecture completely frozen. Only after these layers converged we unfreezed the rest of the model to fine-tune all the components. This technique allowed us to reach a much better validation loss.
 
122
 
123
  <img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/clip-italian.png" alt="drawing" width="95%"/>
124
 
125
  ### Logit Scale
126
 
127
  We tried to improve the loss function in different ways: for example, we tried something similar to a margin based loss but that experiments
128
- didn't go well. Eventually, the thing that worked out the best was fixing the logit_scale value to 20. This value
129
  is used after the computation of the similarity between the images and the texts in CLIP (see the code [here](https://github.com/clip-italian/clip-italian/blob/master/hybrid_clip/modeling_hybrid_clip.py#L64)).
130
  We got this idea from Nils' [video](https://youtu.be/RHXZKUr8qOY) on sentence embeddings.
131
 
@@ -135,23 +141,23 @@ The following picture showcase the effect that these edits have had on our evalu
135
 
136
  <img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/improvements.png" alt="drawing" width="95%"/>
137
 
138
- The purple line is the original training without any of our improvements, you can see how many steps we needed to get the loss down.
139
  Yellow line is the loss with the new optimizer, it is **striking** to see the time we save from this addition! Not only the loss improves, it
140
- also converges much faster! Blue line shows the results when
141
- fixed scaling is added with the new optimization. Finally, we added the backbone freezing part and you can see the
142
  results in the light blue loss. Nonetheless, as common in deep learning, having more data played a big role and was another key element
143
  to reduce the loss.
144
 
145
 
146
  # Scientific Validity
147
 
148
- We split this section in two: we first provide a quantitative evaluation to ensure that what we are learning is good and we then
149
- show some qualitative examples of images found by the model. **All the code we have written** to run our experiments (in combination with
150
  code made available by Nils Reimers and by the authors of the original CLIP) is available.
151
 
152
  ## Quantitative Evaluation
153
- Those images are definitely cool and interesting, but a model is nothing without validation.
154
- Since this is the first clip-based model in Italian, we decided to use the multilingual CLIP model as a comparison baseline.
155
 
156
  ### mCLIP
157
 
@@ -163,13 +169,13 @@ great capabilities in representing multilingual text in the same space of the im
163
  ### Tasks
164
 
165
  We selected two different tasks:
166
- + image-retrieval, in which given a caption the model finds the most similar image
167
  + zero-shot classification, in which given an image and a set of captions (or labels), the model finds
168
  the best matching caption for the image
169
 
170
  ### Reproducibiliy
171
 
172
- Both experiments should be very easy to replicate, we share the two colab notebook we used to compute the two results
173
 
174
  + [Image Retrieval](https://colab.research.google.com/drive/1bLVwVKpAndpEDHqjzxVPr_9nGrSbuOQd?usp=sharing)
175
  + [ImageNet Zero Shot Classification](https://colab.research.google.com/drive/1zfWeVWY79XXH63Ci-pk8xxx3Vu_RRgW-?usp=sharing)
@@ -177,8 +183,8 @@ Both experiments should be very easy to replicate, we share the two colab notebo
177
 
178
  ### Image Retrieval
179
 
180
- This experiment is run against the MSCOCO-IT validation set (that we haven't used in training). Given in input
181
- a caption from the dataset, we search for the most similar image in the MSCOCO-IT validation set and check if this is the one that was
182
  described by the original caption. As evaluation metrics we use the MRR@K.
183
 
184
  | MRR | CLIP-Italian | mCLIP |
@@ -187,13 +193,14 @@ described by the original caption. As evaluation metrics we use the MRR@K.
187
  | MRR@5 | **0.5039** | 0.3957|
188
  | MRR@10 | **0.5204** | 0.4129|
189
 
190
- It is true that we used MSCOCO-IT in training, and this might give us an advantage. However, the original CLIP model was trained
191
- on 400million images (and some of them probably were from MSCOCO).
192
 
193
  ### Zero-shot image classification
194
 
195
  This experiment replicates the original one run by OpenAI on zero-shot image classification on ImageNet.
196
- To do this, we used DeepL to translate the image labels in ImageNet. We evaluate the models computing the accuracy at different levels.
 
197
 
198
  | Accuracy | CLIP-Italian | mCLIP |
199
  | --------------- | ------------ |-------|
@@ -206,26 +213,26 @@ To do this, we used DeepL to translate the image labels in ImageNet. We evaluate
206
 
207
  Our results confirm that CLIP-Italian is very competitive and beats mCLIP on the two different task
208
  we have been testing. Note, however, that our results are lower than those shown in the original OpenAI
209
- paper (see, [Radford et al., 2021](https://arxiv.org/abs/2103.00020)) that was evaluated on English data.
210
  However, considering that our results are in line with those obtained by mCLIP we think that the translated image
211
- labels might have had an impact on the final scores.
212
 
213
  ## Qualitative Evaluation
214
 
215
- We hereby show some very interesting properties of the model. One is its ability to detect colors,
216
  then there is its (partial) counting ability and finally the ability of understanding more complex queries. You can find
217
  more examples in the "*Examples & Applications*" section of this demo.
218
 
219
  To our own surprise, many of the answers the model gives make a lot of sense! Note that the model, in this case,
220
  is searching the right image from a set of 25K images from an Unsplash dataset.
221
 
222
- Look at the following - slightly cherry picked (but not even that much) - examples:
223
 
224
  ### Colors
225
- Here's a yellow flower
226
  <img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/fiore_giallo.png" alt="drawing" width="600"/>
227
 
228
- And here's a blu flower
229
  <img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/fiore_blu.png" alt="drawing" width="600"/>
230
 
231
  ### Counting
@@ -246,24 +253,26 @@ And finally, here's a very nice "cat on a chair"
246
  # Broader Outlook
247
 
248
  We believe that this model can be useful for many different applications. From image classification
249
- to clustering, a model like CLIP Italian can be used to support researchers and practitioners in many different tasks.
250
- Indeed, not only it can be useful in research, but also in industry. A very interesting use-case is given by ecommerce platforms:
251
  these website often deal with a main source of text that is the query engine and with lots of images of the products. CLIP Italian
252
  can be a killer app in this context, providing a way to search for images and text. Nonetheless, Italy has many different collections
253
- of photos in digital format. For example, the [Istituto Luce Cinecittà](https://it.wikipedia.org/wiki/Istituto_Luce_Cinecitt%C3%A0) is an Italian governative entity that collects photos of Italy since the
254
- early 1900 and it is part of the largest movie studios in Europe (Cinecittà).
 
255
 
256
  # Limitations and Bias
257
 
258
  Currently, the model is not without limits. To mention one, its counting capabilities seem very cool, but from our experiments the model
259
- finds difficult to count after three; this is a general limitation that is common to many model like this.
260
 
261
- There are even more evident issues: we found some emergence of biases and stereotypes that got in our model from different factors:
262
- searching for "una troia" ("a bitch") on the CC dataset shows the picture of a woman. The model's capability even increase this issue, as searching for "due troie" ("two bitches")
263
- gives again, as a results, the picture of two women. BERT models are not free from bias. Indeed, different BERT models - Italians included - are prone to create stereotyped sentences that are hurtful ([Nozza et al., 2021](https://www.aclweb.org/anthology/2021.naacl-main.191.pdf))
 
264
 
265
- Unfortunately, this kind of issues is common to many machine learning algorithms (check [Abit et al., 2021](https://arxiv.org/abs/2101.05783) for bias in GPT-3 as an example) and
266
- suggest we need to work even harder on this problem that affects our **society**.
267
 
268
  # Useful Links
269
 
 
1
  # Italian CLIP
2
 
3
+ CLIP ([Radford et al., 2021](https://arxiv.org/abs/2103.00020)) is an amazing model that can learn to represent images and text jointly in the same space.
4
 
5
  In this project, we aim to propose the first CLIP model trained on Italian data, that in this context can be considered a
6
+ low resource language. Using a few techniques, we have been able to fine-tune a SOTA Italian CLIP model with **only 1.4 million** training samples. Our Italian CLIP model
7
  is built upon the pre-trained [Italian BERT](https://huggingface.co/dbmdz/bert-base-italian-xxl-cased) model provided by [dbmdz](https://huggingface.co/dbmdz) and the OpenAI
8
  [vision transformer](https://huggingface.co/openai/clip-vit-base-patch32).
9
 
 
14
  + **Broader Outlook**: We always kept in mind which are the possible usages and limitations of this model.
15
 
16
  We put our **hearts** and **souls** into the project during this week! Not only did we work on a cool project, but we were
17
+ able to make new friends and learn a lot from each other to work towards a common goal!
18
+ Thank you for this amazing opportunity, we hope you will like the results! :heart:
19
 
20
  # Demo
21
 
 
33
 
34
  # Novel Contributions
35
 
36
+ The original CLIP model was trained on 400 million image-text pairs; this amount of data is currently not available for Italian.
37
  We indeed worked in a **low-resource setting**. The only datasets for Italian captioning in the literature are MSCOCO-IT (a translated version of MSCOCO) and WIT.
38
  To get competitive results we followed three strategies:
39
  1. more and better data;
40
  2. better augmentations;
41
+ 3. better training strategies.
42
 
43
  For those interested, we have a :comet: [Comet](https://www.comet.ml/g8a9/clip-italian/reports/clip-italian-training-metrics) report
44
+ that shows a **subset** of the experiments we ran. Different hyper-parameters played a role in reducing the validation
45
  loss. The optimizer we used gave us great performance and huge conversion speed, more data and augmentations helped a lot in generalizing,
46
  working on the training and on the loss gave us the final increase that you can see in the results.
47
 
 
73
  could not retrieve them all. Eventually, we had to translate the captions to Italian. We have been able to collect
74
  a dataset with 700K translated captions.
75
 
76
+ + [La Foto del Giorno](https://www.ilpost.it/foto-del-giorno/). This image-caption dataset is collected from [Il Post](https://www.ilpost.it/), a prominent Italian online newspaper.
77
+ The collection contains almost 30K pairs: starting from early 2011, for each day, editors at Il Post pick several images picturing the most salient events in the world.
78
+ Each photo comes along with an Italian caption.
79
 
80
 
81
  ### A Note on Translations
 
103
 
104
  ## Better Augmentations
105
 
106
+ We knew that without a good augmentation strategy we could never get competitive results to a model trained on 400 million images. Therefore we implemented heavy augmentations to make the training more data efficient.
107
+ They include random affine transformations and perspective changes, as well as occasional equalization and random changes to brightness, contrast, saturation and hue. We made sure to keep hue augmentations limited however, to still give the model the ability to learn color definitions.
108
  While we would have liked to have augmentations for the captions as well after some experimentation we settled with random sampling from the five captions available in MSCOCO and leaving the rest of the captions unmodified.
109
 
110
  ## Better Training
 
115
 
116
  ### Optimizer
117
 
118
+ While the initial code used AdamW as an optimizer, we soon noticed that it introduced some bad properties into the training. The model strated to overfit relatively quickly and the weight decay made this effect worse.
119
+ We eventually decided to use an optimization strategy that had worked well for us in similar cases and used AdaBelief with Adaptive Gradient Clipping (AGC) and a Cosine Annealing Schedule.
120
+ Together with slightly tuning the learning rate this helped us to reduce the validation loss by more than 25%.
121
  Our implementation is available online [here](https://github.com/clip-italian/clip-italian/blob/master/hybrid_clip/run_hybrid_clip.py#L667).
122
 
123
  ### Backbone Freezing
124
 
125
+ The ViT used by OpenAI was already trained on 400 million images and it is the element in our architecture that probably required the least training.
126
+ The same is true for the BERT model we use. To allow the randomly initialized re-projection layers to warm up without messing with the tuned weights of the backbones we decided to do a first training with the backbones of our architecture completely frozen.
127
+ Only after these layers converged we unfreezed the rest of the model to fine-tune all the components. This technique allowed us to reach a much better validation loss.
128
 
129
  <img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/clip-italian.png" alt="drawing" width="95%"/>
130
 
131
  ### Logit Scale
132
 
133
  We tried to improve the loss function in different ways: for example, we tried something similar to a margin based loss but that experiments
134
+ did not yield the results we hoped for. Eventually, the thing that worked out the best was fixing the logit_scale value to 20. This value
135
  is used after the computation of the similarity between the images and the texts in CLIP (see the code [here](https://github.com/clip-italian/clip-italian/blob/master/hybrid_clip/modeling_hybrid_clip.py#L64)).
136
  We got this idea from Nils' [video](https://youtu.be/RHXZKUr8qOY) on sentence embeddings.
137
 
 
141
 
142
  <img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/improvements.png" alt="drawing" width="95%"/>
143
 
144
+ The purple line is the original training without any of our improvements, you can see that we needed a lot of training steps to get the loss down.
145
  Yellow line is the loss with the new optimizer, it is **striking** to see the time we save from this addition! Not only the loss improves, it
146
+ also converges significantly faster! The blue line shows the results when
147
+ fixed scaling is used in addition to the new optimizer. Finally, we added the backbone freezing strategy and you can see the
148
  results in the light blue loss. Nonetheless, as common in deep learning, having more data played a big role and was another key element
149
  to reduce the loss.
150
 
151
 
152
  # Scientific Validity
153
 
154
+ We split this section in two: we first provide a quantitative evaluation to ensure that what we are learning is really good.
155
+ We then show some qualitative examples of images found by the model. **All the code we have written** to run our validation experiments (in combination with
156
  code made available by Nils Reimers and by the authors of the original CLIP) is available.
157
 
158
  ## Quantitative Evaluation
159
+ Showing great images is definitely cool and interesting, but a model is nothing without validation.
160
+ Since this is the first clip-based model in Italian, we decided to use the multilingual CLIP model as a comparison baseline.
161
 
162
  ### mCLIP
163
 
 
169
  ### Tasks
170
 
171
  We selected two different tasks:
172
+ + image-retrieval, in which given a caption the model finds the most semantically similar image
173
  + zero-shot classification, in which given an image and a set of captions (or labels), the model finds
174
  the best matching caption for the image
175
 
176
  ### Reproducibiliy
177
 
178
+ In order to make both experiments very easy to replicate, we share the colab notebooks we used to compute the results.
179
 
180
  + [Image Retrieval](https://colab.research.google.com/drive/1bLVwVKpAndpEDHqjzxVPr_9nGrSbuOQd?usp=sharing)
181
  + [ImageNet Zero Shot Classification](https://colab.research.google.com/drive/1zfWeVWY79XXH63Ci-pk8xxx3Vu_RRgW-?usp=sharing)
 
183
 
184
  ### Image Retrieval
185
 
186
+ This experiment is run against the MSCOCO-IT validation set (that we haven't used during training). Given an input caption from the dataset,
187
+ we search for the most similar image in the MSCOCO-IT validation set and check if this is the one that was
188
  described by the original caption. As evaluation metrics we use the MRR@K.
189
 
190
  | MRR | CLIP-Italian | mCLIP |
 
193
  | MRR@5 | **0.5039** | 0.3957|
194
  | MRR@10 | **0.5204** | 0.4129|
195
 
196
+ It is true that we used the training set of MSCOCO-IT in training, and this might give us an advantage. However, the original CLIP model was trained
197
+ on 400million images (and some of them might have been from MSCOCO).
198
 
199
  ### Zero-shot image classification
200
 
201
  This experiment replicates the original one run by OpenAI on zero-shot image classification on ImageNet.
202
+ To do this, we used DeepL to automatically translate the image labels in ImageNet. No manual engineering of the labels or prompts was done.
203
+ We evaluate the models computing the accuracy at different levels.
204
 
205
  | Accuracy | CLIP-Italian | mCLIP |
206
  | --------------- | ------------ |-------|
 
213
 
214
  Our results confirm that CLIP-Italian is very competitive and beats mCLIP on the two different task
215
  we have been testing. Note, however, that our results are lower than those shown in the original OpenAI
216
+ paper (see, [Radford et al., 2021](https://arxiv.org/abs/2103.00020)) that was trained and evaluated on English data.
217
  However, considering that our results are in line with those obtained by mCLIP we think that the translated image
218
+ labels most probably had an impact on the final scores.
219
 
220
  ## Qualitative Evaluation
221
 
222
+ We hereby show some interesting properties of the model. One is its ability to detect colors,
223
  then there is its (partial) counting ability and finally the ability of understanding more complex queries. You can find
224
  more examples in the "*Examples & Applications*" section of this demo.
225
 
226
  To our own surprise, many of the answers the model gives make a lot of sense! Note that the model, in this case,
227
  is searching the right image from a set of 25K images from an Unsplash dataset.
228
 
229
+ Look at the following - slightly cherry picked - examples:
230
 
231
  ### Colors
232
+ Here's "a yellow flower"
233
  <img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/fiore_giallo.png" alt="drawing" width="600"/>
234
 
235
+ And here's "a blu flower"
236
  <img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/fiore_blu.png" alt="drawing" width="600"/>
237
 
238
  ### Counting
 
253
  # Broader Outlook
254
 
255
  We believe that this model can be useful for many different applications. From image classification
256
+ to clustering, a model like our Italian CLIP can be used to support researchers and practitioners in many different tasks.
257
+ Indeed, not only can it be useful in research, but also in industry. A very interesting use-case is given by ecommerce platforms:
258
  these website often deal with a main source of text that is the query engine and with lots of images of the products. CLIP Italian
259
  can be a killer app in this context, providing a way to search for images and text. Nonetheless, Italy has many different collections
260
+ of photos in digital format that are difficult to categorize efficiently.
261
+ For example, the [Istituto Luce Cinecittà](https://it.wikipedia.org/wiki/Istituto_Luce_Cinecitt%C3%A0) is an Italian governative entity that collects photos of Italy since the
262
+ early 1900 and is part of the largest movie studios in Europe (Cinecittà). A semantic way of finding images in their catalog could be an amazing use case.
263
 
264
  # Limitations and Bias
265
 
266
  Currently, the model is not without limits. To mention one, its counting capabilities seem very cool, but from our experiments the model
267
+ finds difficult to count after three; this is a general limitation that is common to many models of this type.
268
 
269
+ There are even more evident issues that we found in our model. Due to the unfiltered nature of our training data the model is exposed to many biases such as sexism, racism, stereotypes,
270
+ slurs and gore that it might replicate without the awareness of their hurtful and harmful nature. Indeed, different BERT models - Italian ones included - are prone to create stereotyped
271
+ sentences that are hurtful ([Nozza et al., 2021](https://www.aclweb.org/anthology/2021.naacl-main.191.pdf)).
272
+ While this is not something we intended, it certainly is something that we share the blame for since we were not able to avoid it.
273
 
274
+ Unfortunately, these kinds of issues are common to many machine learning algorithms (check [Abit et al., 2021](https://arxiv.org/abs/2101.05783) for bias in GPT-3 as an example).
275
+ This suggests we need to find better approaches to counteract this problem that affects **our society**.
276
 
277
  # Useful Links
278