g8a9 commited on
Commit
8f903f5
1 Parent(s): 79372f7

Minor changes

Browse files
Files changed (1) hide show
  1. introduction.md +9 -9
introduction.md CHANGED
@@ -35,14 +35,14 @@ different applications that can start from here.
35
 
36
  The original CLIP model was trained on 400 million image-text pairs; this amount of data is currently not available for Italian.
37
  We indeed worked in a **low-resource setting**. The only datasets for Italian captioning in the literature are MSCOCO-IT (a translated version of MSCOCO) and WIT.
38
- To get competitive results we followed three strategies:
39
  1. more and better data;
40
  2. better augmentations;
41
  3. better training strategies.
42
 
43
  For those interested, we have a :comet: [Comet](https://www.comet.ml/g8a9/clip-italian/reports/clip-italian-training-metrics) report
44
  that shows a **subset** of the experiments we ran. Different hyper-parameters played a role in reducing the validation
45
- loss. The optimizer we used gave us great performance and huge conversion speed, more data and augmentations helped a lot in generalizing,
46
  working on the training and on the loss gave us the final increase that you can see in the results.
47
 
48
  ## More and Better Data
@@ -103,9 +103,9 @@ that those interested can check the quality. The Google Sheet is [here](https://
103
 
104
  ## Better Augmentations
105
 
106
- We knew that without a good augmentation strategy we could never get competitive results to a model trained on 400 million images. Therefore we implemented heavy augmentations to make the training more data efficient.
107
  They include random affine transformations and perspective changes, as well as occasional equalization and random changes to brightness, contrast, saturation and hue. We made sure to keep hue augmentations limited however, to still give the model the ability to learn color definitions.
108
- While we would have liked to have augmentations for the captions as well after some experimentation we settled with random sampling from the five captions available in MSCOCO and leaving the rest of the captions unmodified.
109
 
110
  ## Better Training
111
 
@@ -123,7 +123,7 @@ Our implementation is available online [here](https://github.com/clip-italian/cl
123
  ### Backbone Freezing
124
 
125
  The ViT used by OpenAI was already trained on 400 million images and it is the element in our architecture that probably required the least training.
126
- The same is true for the BERT model we use. To allow the randomly initialized re-projection layers to warm up without messing with the tuned weights of the backbones we decided to do a first training with the backbones of our architecture completely frozen.
127
  Only after these layers converged we unfreezed the rest of the model to fine-tune all the components. This technique allowed us to reach a much better validation loss.
128
 
129
  <img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/clip-italian.png" alt="drawing" width="95%"/>
@@ -137,11 +137,11 @@ We got this idea from Nils' [video](https://youtu.be/RHXZKUr8qOY) on sentence em
137
 
138
  ### Effect of Our Edits
139
 
140
- The following picture showcase the effect that these edits have had on our evaluation loss:
141
 
142
  <img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/improvements.png" alt="drawing" width="95%"/>
143
 
144
- The purple line is the original training without any of our improvements, you can see that we needed a lot of training steps to get the loss down.
145
  Yellow line is the loss with the new optimizer, it is **striking** to see the time we save from this addition! Not only the loss improves, it
146
  also converges significantly faster! The blue line shows the results when
147
  fixed scaling is used in addition to the new optimizer. Finally, we added the backbone freezing strategy and you can see the
@@ -266,8 +266,8 @@ early 1900 and is part of the largest movie studios in Europe (Cinecittà). A se
266
  Currently, the model is not without limits. To mention one, its counting capabilities seem very cool, but from our experiments the model
267
  finds difficult to count after three; this is a general limitation that is common to many models of this type.
268
 
269
- There are even more evident issues that we found in our model. Due to the unfiltered nature of our training data the model is exposed to many biases such as sexism, racism, stereotypes,
270
- slurs and gore that it might replicate without the awareness of their hurtful and harmful nature. Indeed, different BERT models - Italian ones included - are prone to create stereotyped
271
  sentences that are hurtful ([Nozza et al., 2021](https://www.aclweb.org/anthology/2021.naacl-main.191.pdf)).
272
  While this is not something we intended, it certainly is something that we share the blame for since we were not able to avoid it.
273
 
 
35
 
36
  The original CLIP model was trained on 400 million image-text pairs; this amount of data is currently not available for Italian.
37
  We indeed worked in a **low-resource setting**. The only datasets for Italian captioning in the literature are MSCOCO-IT (a translated version of MSCOCO) and WIT.
38
+ To get competitive results, we followed three strategies:
39
  1. more and better data;
40
  2. better augmentations;
41
  3. better training strategies.
42
 
43
  For those interested, we have a :comet: [Comet](https://www.comet.ml/g8a9/clip-italian/reports/clip-italian-training-metrics) report
44
  that shows a **subset** of the experiments we ran. Different hyper-parameters played a role in reducing the validation
45
+ loss. The optimizer we used gave us great performance and fast convergence, more data and augmentations helped a lot in generalizing,
46
  working on the training and on the loss gave us the final increase that you can see in the results.
47
 
48
  ## More and Better Data
 
103
 
104
  ## Better Augmentations
105
 
106
+ We knew that without a good augmentation strategy we could never get competitive results to a model trained on 400 million images. Therefore, we implemented heavy augmentations to make the training more data efficient.
107
  They include random affine transformations and perspective changes, as well as occasional equalization and random changes to brightness, contrast, saturation and hue. We made sure to keep hue augmentations limited however, to still give the model the ability to learn color definitions.
108
+ While we would have liked to have augmentations for the captions as well, after some experimentation we settled with random sampling from the five captions available in MSCOCO and leaving the rest of the captions unmodified.
109
 
110
  ## Better Training
111
 
 
123
  ### Backbone Freezing
124
 
125
  The ViT used by OpenAI was already trained on 400 million images and it is the element in our architecture that probably required the least training.
126
+ The same is true for the BERT model we use. To allow the randomly initialized re-projection layers to warm up without messing with the tuned weights of the backbones, we decided to do a first training with the backbones of our architecture completely frozen.
127
  Only after these layers converged we unfreezed the rest of the model to fine-tune all the components. This technique allowed us to reach a much better validation loss.
128
 
129
  <img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/clip-italian.png" alt="drawing" width="95%"/>
 
137
 
138
  ### Effect of Our Edits
139
 
140
+ The following picture showcases the effect that these edits have had on our evaluation loss:
141
 
142
  <img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/improvements.png" alt="drawing" width="95%"/>
143
 
144
+ The purple line is the original training without any of our improvements: you can see that we needed a lot of training steps to get the loss down.
145
  Yellow line is the loss with the new optimizer, it is **striking** to see the time we save from this addition! Not only the loss improves, it
146
  also converges significantly faster! The blue line shows the results when
147
  fixed scaling is used in addition to the new optimizer. Finally, we added the backbone freezing strategy and you can see the
 
266
  Currently, the model is not without limits. To mention one, its counting capabilities seem very cool, but from our experiments the model
267
  finds difficult to count after three; this is a general limitation that is common to many models of this type.
268
 
269
+ There are even more evident issues that we found in our model. Due to the unfiltered nature of our training data, the model is exposed to many biases such as sexism, racism, stereotypes,
270
+ slurs, and gore that it might replicate without the awareness of their hurtful and harmful nature. Indeed, different BERT models - Italian ones included - are prone to create stereotyped
271
  sentences that are hurtful ([Nozza et al., 2021](https://www.aclweb.org/anthology/2021.naacl-main.191.pdf)).
272
  While this is not something we intended, it certainly is something that we share the blame for since we were not able to avoid it.
273