Spaces:

clip-italian
/

clip-italian-demo

Running

App Files Files Community

Silvia Terragni commited on Jul 19, 2021

Commit

c702c34

1 Parent(s): 1ed0e02

simple fixes on introduction.md

Browse files

Files changed (1) hide show

introduction.md +7 -6

introduction.md CHANGED Viewed

@@ -29,10 +29,10 @@ is going to compute the similarity between the image and each label. The webapp
 The original CLIP model was trained on 400 million image-text pairs; this amount of data is not available for Italian.
 We indeed worked in a **low-resource setting**. The only datasets for Italian captioning in the literature are MSCOCO-IT (a translated version of MSCOCO) and WIT.
-To get competitive results we followed three strategies:
-1. more and better data;
-2. better augmentations;
-3. better training.
 ## More and Better Data
@@ -80,10 +80,10 @@ Our implementation is available online [here](https://github.com/clip-italian/cl
 ### Backbone Freezing
-The ViT used by OpenAI was already trained on 400million images and it is the element in our architecture that probably required less training.
 The same is true for the BERT model we use. To allow the randomly initialized Re-projection Layers to warm up without messing with the tuned weights of the backbones we decided to do a first training with the backbones of our architecture completely frozen. Only after these layers converged we unfreezed the rest of the model to fine-tune all the components. This technique allowed us to reach a much better validation loss.
-<img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/clip-italian.png" alt="drawing" width="50%"/>
 # Scientific Validity
@@ -166,6 +166,7 @@ And what about "two cats"?
 ### Complex Queries
 Have you ever seen "two brown horses"?
 <img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/due_cavalli_marroni.png" alt="drawing" width="600"/>
 And finally, here's a very nice "cat on a chair"
 <img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/gatto_su_sedia.png" alt="drawing" width="600"/>

 The original CLIP model was trained on 400 million image-text pairs; this amount of data is not available for Italian.
 We indeed worked in a **low-resource setting**. The only datasets for Italian captioning in the literature are MSCOCO-IT (a translated version of MSCOCO) and WIT.
+To get competitive results we followed three strategies:
+  1. more and better data;
+  2. better augmentations;
+  3. better training.
 ## More and Better Data
 ### Backbone Freezing
+The ViT used by OpenAI was already trained on 400 million images and it is the element in our architecture that probably required less training.
 The same is true for the BERT model we use. To allow the randomly initialized Re-projection Layers to warm up without messing with the tuned weights of the backbones we decided to do a first training with the backbones of our architecture completely frozen. Only after these layers converged we unfreezed the rest of the model to fine-tune all the components. This technique allowed us to reach a much better validation loss.
+<img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/clip-italian.png" alt="drawing" width="80%"/>
 # Scientific Validity
 ### Complex Queries
 Have you ever seen "two brown horses"?
 <img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/due_cavalli_marroni.png" alt="drawing" width="600"/>
 And finally, here's a very nice "cat on a chair"
 <img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/gatto_su_sedia.png" alt="drawing" width="600"/>