Spaces:

clip-italian
/

clip-italian-demo

Running

App Files Files Community

vinid commited on Jul 18, 2021

Commit

9ea982d

•

1 Parent(s): 3140e4f

updating the readme.md

Browse files

Files changed (1) hide show

readme.md +13 -8

readme.md CHANGED Viewed

@@ -15,13 +15,13 @@ Thank you for this amazing opportunity, we hope you will like the results. :hear
 # Novel Contributions
 The original CLIP model was trained on 400 million image-text pairs; this amount of data is not available for Italian.
-We indeed worked in a **low-resource setting**. The only datasets for captioning in the literature are MSCOCO-IT (a translated version of MSCOCO) and WIT.
 To get competitive results we followed three strategies:
-1. more data;
 2. better augmentations;
 3. better training.
-## More Data
 We eventually had to deal with the fact that we do not have the same data that OpenAI had during the training of CLIP.
 Thus, we tried to add as much data as possible while keeping the data-quality as high as possible.
@@ -29,11 +29,13 @@ Thus, we tried to add as much data as possible while keeping the data-quality as
 We considered three main sources of data:
 + [WIT](https://github.com/google-research-datasets/wit) is an image-caption dataset collected from Wikipedia (see,
-[Srinivasan et al., 2021](https://arxiv.org/pdf/2103.01913.pdf)). Most of these captions describe ontological knowledge and encyclopedic facts (e.g., Roberto Baggio in 1994).
 However, this kind of text, without more information, is not useful to learn a good mapping between images and captions.
-  On the other hand, this text is written in Italian and it is good quality.
-  To prevent polluting the data with captions that are not meaningful, we used POS tagging
-  on the data and removed all the captions that were composed for the 80% or more by PROPN.
   Example: ....
@@ -124,9 +126,12 @@ the translated image labels might have had an impact on the final scores.
 ## Qualitative Evaluation
 ### Colors
-### Numbers
 # Broader Outlook

 # Novel Contributions
 The original CLIP model was trained on 400 million image-text pairs; this amount of data is not available for Italian.
+We indeed worked in a **low-resource setting**. The only datasets for Italian captioning in the literature are MSCOCO-IT (a translated version of MSCOCO) and WIT.
 To get competitive results we followed three strategies:
+1. more and better data;
 2. better augmentations;
 3. better training.
+## More and Better Data
 We eventually had to deal with the fact that we do not have the same data that OpenAI had during the training of CLIP.
 Thus, we tried to add as much data as possible while keeping the data-quality as high as possible.
 We considered three main sources of data:
 + [WIT](https://github.com/google-research-datasets/wit) is an image-caption dataset collected from Wikipedia (see,
+[Srinivasan et al., 2021](https://arxiv.org/pdf/2103.01913.pdf)). We focused on the *Reference Description* captions described in the paper as they are
+the ones of highest quality. Nonetheless, many of these captions describe ontological knowledge and encyclopedic facts (e.g., Roberto Baggio in 1994).
 However, this kind of text, without more information, is not useful to learn a good mapping between images and captions.
+  On the other hand, this text is written in Italian and it is of good quality.
+  To prevent polluting the data with captions that are not meaningful, we used *POS tagging*
+  on the text and removed all the captions that were composed for the 80% or more by PROPN. This is a simple solution that allowed us to retain much
+  of the dataset, without introducing noise.
   Example: ....
 ## Qualitative Evaluation
+We hereby show some very interesting properties of the model. The first one is its ability to detect colors and the second one is its (partial) counting
+ability. To our own surprise, many of the answers the model gives make a lot of sense!
 ### Colors
+### Counting
 # Broader Outlook