Spaces:

clip-italian
/

clip-italian-demo

Running

App Files Files Community

4rtemi5 commited on Jul 24, 2021

Commit

7377c04

•

1 Parent(s): 3a77c4a

fixing typos

Browse files

Files changed (1) hide show

introduction.md +12 -13

introduction.md CHANGED Viewed

@@ -71,7 +71,7 @@ MSCOCO dataset and have been translated with Microsoft Translator. The 2017 vers
 + [Conceptual Captions](https://ai.google.com/research/ConceptualCaptions/). This image-caption dataset comes from
 the work by [Sharma et al., 2018](https://aclanthology.org/P18-1238.pdf). There are more than 3mln image-caption pairs in
-this dataset and these have been collected from the web. We downloaded the images with the URLs provided by the dataset, but we
 could not retrieve them all. Eventually, we had to translate the captions to Italian. We have been able to collect
 a dataset with 700K translated captions.
@@ -83,14 +83,14 @@ Each photo comes along with an Italian caption.
 ### A Note on Translations
 Instead of relying on open-source translators, we decided to use DeepL. **Translation quality** of the data was the main
-reason of this choice. With the few images (wrt OpenAI) that we have, we cannot risk polluting our own data. CC is a great resource
 but the captions have to be handled accordingly. We translated 700K captions and we evaluated their quality:
 Three of us looked at a sample of 100 of the translations and rated them with scores from 1 to 4.
-The meaning of the value is as follows: 1, the sentence has lost is meaning or it's not possible to understand it; 2, it is possible to get the idea
-but there something wrong; 3, good, however a native speaker might complain about some translations; 4, good translation.
-The average score was of 3.78 and the three annotators had an inter-rater agreement - computed with [Gwet's AC1](https://bpspsychub.onlinelibrary.wiley.com/doi/full/10.1348/000711006X126600) using ordinal
 weighting - of 0.858 (great agreement!).
 | English Captions                                                                  | Italian Captions                                                                                        |
@@ -99,7 +99,6 @@ weighting - of 0.858 (great agreement!).
 | person walking down the aisle                                                     | persona che cammina lungo la navata                                                                     |
 | popular rides at night at the county fair                                         | giostre popolari di notte alla fiera della contea                                                       |
-\t\t\t
 We know that we annotated our own data; in the spirit of fairness we also share the annotations and the captions so
 that those interested can check the quality. The Google Sheet is [here](https://docs.google.com/spreadsheets/d/1m6TkcpJbmJlEygL7SXURIq2w8ZHuVvsmdEuCIH0VENk/edit?usp=sharing).
@@ -113,7 +112,7 @@ While we would have liked to have augmentations for the captions as well, after
 After different trials, we realized that the usual way of training this model was
 not good enough to get good results. We thus modified three different parts of the
-training pipeline: the optimizer, the training with frozen components and the logit_scale parameter.
 ### Optimizer
@@ -124,9 +123,9 @@ Our implementation is available online [here](https://github.com/clip-italian/cl
 ### Backbone Freezing
-The ViT used by OpenAI was already trained on 400 million images and it is the element in our architecture that probably required the least training.
 The same is true for the BERT model we use. To allow the randomly initialized re-projection layers to warm up without messing with the tuned weights of the backbones, we decided to do a first training with the backbones of our architecture completely frozen.
-Only after these layers converged we unfreezed the rest of the model to fine-tune all the components. This technique allowed us to reach a much better validation loss.
 <img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/clip-italian.png" alt="drawing" width="95%"/>
@@ -146,14 +145,14 @@ The following picture showcases the effect that these edits have had on our eval
 The purple line is the original training without any of our improvements: you can see that we needed a lot of training steps to get the loss down.
 Yellow line is the loss with the new optimizer, it is **striking** to see the time we save from this addition! Not only the loss improves, it
 also converges significantly faster! The blue line shows the results when
-fixed scaling is used in addition to the new optimizer. Finally, we added the backbone freezing strategy and you can see the
 results in the light blue loss. Nonetheless, as common in deep learning, having more data played a big role and was another key element
 to reduce the loss.
 # Scientific Validity
-We split this section in two: we first provide a quantitative evaluation to ensure that what we are learning is really good.
 We then show some qualitative examples of images found by the model. **All the code we have written** to run our validation experiments (in combination with
 code made available by Nils Reimers and by the authors of the original CLIP) is available.
@@ -195,7 +194,7 @@ described by the original caption. As evaluation metrics we use the MRR@K.
 | MRR@5           | **0.5039**   | 0.3957|
 | MRR@10          | **0.5204**   | 0.4129|
-_If the table above doesn not show, you can have a look at it [here](https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/table_imagenet.png)._
 It is true that we used the training set of MSCOCO-IT in training, and this might give us an advantage. However, the original CLIP model was trained
 on 400million images (and some of them might have been from MSCOCO).
@@ -238,7 +237,7 @@ Look at the following - slightly cherry picked - examples:
 Here's "a yellow flower"
 <img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/fiore_giallo.png" alt="drawing" width="600"/>
-And here's "a blu flower"
 <img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/fiore_blu.png" alt="drawing" width="600"/>
 ### Counting

 + [Conceptual Captions](https://ai.google.com/research/ConceptualCaptions/). This image-caption dataset comes from
 the work by [Sharma et al., 2018](https://aclanthology.org/P18-1238.pdf). There are more than 3mln image-caption pairs in
+this dataset that have been collected from the web. We downloaded the images with the URLs provided by the dataset, but we
 could not retrieve them all. Eventually, we had to translate the captions to Italian. We have been able to collect
 a dataset with 700K translated captions.
 ### A Note on Translations
 Instead of relying on open-source translators, we decided to use DeepL. **Translation quality** of the data was the main
+reason of this choice. With the few images (wrt OpenAI) that we have, we cannot risk polluting our own data. CC is a great resource,
 but the captions have to be handled accordingly. We translated 700K captions and we evaluated their quality:
 Three of us looked at a sample of 100 of the translations and rated them with scores from 1 to 4.
+The meaning of the value is as follows: 1, the sentence has lost is meaning, or it's not possible to understand it; 2, it is possible to get the idea
+but there is something wrong; 3, good, however a native speaker might complain about some translations; 4, good translation.
+The average score was of 3.78, and the three annotators had an inter-rater agreement - computed with [Gwet's AC1](https://bpspsychub.onlinelibrary.wiley.com/doi/full/10.1348/000711006X126600) using ordinal
 weighting - of 0.858 (great agreement!).
 | English Captions                                                                  | Italian Captions                                                                                        |
 | person walking down the aisle                                                     | persona che cammina lungo la navata                                                                     |
 | popular rides at night at the county fair                                         | giostre popolari di notte alla fiera della contea                                                       |
 We know that we annotated our own data; in the spirit of fairness we also share the annotations and the captions so
 that those interested can check the quality. The Google Sheet is [here](https://docs.google.com/spreadsheets/d/1m6TkcpJbmJlEygL7SXURIq2w8ZHuVvsmdEuCIH0VENk/edit?usp=sharing).
 After different trials, we realized that the usual way of training this model was
 not good enough to get good results. We thus modified three different parts of the
+training pipeline: the optimizer, the training with frozen components, and the fixed logit_scale parameter.
 ### Optimizer
 ### Backbone Freezing
+The ViT used by OpenAI was already trained on 400 million images, and it is the element in our architecture that probably requires the least amount of training.
 The same is true for the BERT model we use. To allow the randomly initialized re-projection layers to warm up without messing with the tuned weights of the backbones, we decided to do a first training with the backbones of our architecture completely frozen.
+Only after these layers converged we unfroze the rest of the model to fine-tune all the components. This technique allowed us to reach a much better validation loss.
 <img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/clip-italian.png" alt="drawing" width="95%"/>
 The purple line is the original training without any of our improvements: you can see that we needed a lot of training steps to get the loss down.
 Yellow line is the loss with the new optimizer, it is **striking** to see the time we save from this addition! Not only the loss improves, it
 also converges significantly faster! The blue line shows the results when
+fixed scaling is used in addition to the new optimizer. Finally, we added the backbone freezing strategy, and you can see the
 results in the light blue loss. Nonetheless, as common in deep learning, having more data played a big role and was another key element
 to reduce the loss.
 # Scientific Validity
+We split this section in two: we first provide a quantitative evaluation to ensure that what we are learning is in fact good.
 We then show some qualitative examples of images found by the model. **All the code we have written** to run our validation experiments (in combination with
 code made available by Nils Reimers and by the authors of the original CLIP) is available.
 | MRR@5           | **0.5039**   | 0.3957|
 | MRR@10          | **0.5204**   | 0.4129|
+_If the table above does not show, you can have a look at it [here](https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/table_imagenet.png)._
 It is true that we used the training set of MSCOCO-IT in training, and this might give us an advantage. However, the original CLIP model was trained
 on 400million images (and some of them might have been from MSCOCO).
 Here's "a yellow flower"
 <img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/fiore_giallo.png" alt="drawing" width="600"/>
+And here's "a blue flower"
 <img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/fiore_blu.png" alt="drawing" width="600"/>
 ### Counting