Silvia Terragni commited on
Commit
c702c34
1 Parent(s): 1ed0e02

simple fixes on introduction.md

Browse files
Files changed (1) hide show
  1. introduction.md +7 -6
introduction.md CHANGED
@@ -29,10 +29,10 @@ is going to compute the similarity between the image and each label. The webapp
29
 
30
  The original CLIP model was trained on 400 million image-text pairs; this amount of data is not available for Italian.
31
  We indeed worked in a **low-resource setting**. The only datasets for Italian captioning in the literature are MSCOCO-IT (a translated version of MSCOCO) and WIT.
32
- To get competitive results we followed three strategies:
33
- 1. more and better data;
34
- 2. better augmentations;
35
- 3. better training.
36
 
37
  ## More and Better Data
38
 
@@ -80,10 +80,10 @@ Our implementation is available online [here](https://github.com/clip-italian/cl
80
 
81
  ### Backbone Freezing
82
 
83
- The ViT used by OpenAI was already trained on 400million images and it is the element in our architecture that probably required less training.
84
  The same is true for the BERT model we use. To allow the randomly initialized Re-projection Layers to warm up without messing with the tuned weights of the backbones we decided to do a first training with the backbones of our architecture completely frozen. Only after these layers converged we unfreezed the rest of the model to fine-tune all the components. This technique allowed us to reach a much better validation loss.
85
 
86
- <img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/clip-italian.png" alt="drawing" width="50%"/>
87
 
88
  # Scientific Validity
89
 
@@ -166,6 +166,7 @@ And what about "two cats"?
166
  ### Complex Queries
167
  Have you ever seen "two brown horses"?
168
  <img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/due_cavalli_marroni.png" alt="drawing" width="600"/>
 
169
  And finally, here's a very nice "cat on a chair"
170
  <img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/gatto_su_sedia.png" alt="drawing" width="600"/>
171
 
 
29
 
30
  The original CLIP model was trained on 400 million image-text pairs; this amount of data is not available for Italian.
31
  We indeed worked in a **low-resource setting**. The only datasets for Italian captioning in the literature are MSCOCO-IT (a translated version of MSCOCO) and WIT.
32
+ To get competitive results we followed three strategies:
33
+ 1. more and better data;
34
+ 2. better augmentations;
35
+ 3. better training.
36
 
37
  ## More and Better Data
38
 
 
80
 
81
  ### Backbone Freezing
82
 
83
+ The ViT used by OpenAI was already trained on 400 million images and it is the element in our architecture that probably required less training.
84
  The same is true for the BERT model we use. To allow the randomly initialized Re-projection Layers to warm up without messing with the tuned weights of the backbones we decided to do a first training with the backbones of our architecture completely frozen. Only after these layers converged we unfreezed the rest of the model to fine-tune all the components. This technique allowed us to reach a much better validation loss.
85
 
86
+ <img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/clip-italian.png" alt="drawing" width="80%"/>
87
 
88
  # Scientific Validity
89
 
 
166
  ### Complex Queries
167
  Have you ever seen "two brown horses"?
168
  <img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/due_cavalli_marroni.png" alt="drawing" width="600"/>
169
+
170
  And finally, here's a very nice "cat on a chair"
171
  <img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/gatto_su_sedia.png" alt="drawing" width="600"/>
172