Spaces:

clip-italian
/

clip-italian-demo

Running

Silvia Terragni commited on Jul 25, 2021

Commit

aacfe19

1 Parent(s): 6c1a3f9

reduce WIT description in readme

Files changed (1) hide show

introduction.md CHANGED Viewed

@@ -66,12 +66,11 @@ We considered four main sources of data:
 [Srinivasan et al., 2021](https://arxiv.org/pdf/2103.01913.pdf)). We focused on the *Reference Description* captions
 described in the paper as they are the ones of highest quality. Nonetheless, many of these captions describe ontological knowledge and encyclopedic facts (e.g., Roberto Baggio in 1994).
 However, this kind of text, without more information, is not useful to learn a good mapping between images and captions.
-  On the other hand, this text is written in Italian and it is of good quality. We cannot just remove short captions as some of those
-  are still good (e.g., "running dog"). Thus, to prevent polluting the data with captions that are not meaningful, we used *POS tagging*
   on the text and removed all the captions that were composed for the 80% or more by PROPN (around ~10% of the data). This is a simple solution that allowed us to retain much
   of the dataset, without introducing noise.
-  Captions like: *'Dora Riparia', 'Anna Maria Mozzoni', 'Joey Ramone Place', 'Kim Rhodes', 'Ralph George Hawtrey' * have been removed.
 + [MSCOCO-IT](https://github.com/crux82/mscoco-it). This image-caption dataset comes from the work by [Scaiella et al., 2019](http://www.ai-lc.it/IJCoL/v5n2/IJCOL_5_2_3___scaiella_et_al.pdf). The captions come from the original
 MSCOCO dataset and have been translated with Microsoft Translator. The 2017 version of the MSCOCO training set contains more than

 [Srinivasan et al., 2021](https://arxiv.org/pdf/2103.01913.pdf)). We focused on the *Reference Description* captions
 described in the paper as they are the ones of highest quality. Nonetheless, many of these captions describe ontological knowledge and encyclopedic facts (e.g., Roberto Baggio in 1994).
 However, this kind of text, without more information, is not useful to learn a good mapping between images and captions.
+ To prevent polluting the data with captions that are not meaningful, we used *POS tagging*
   on the text and removed all the captions that were composed for the 80% or more by PROPN (around ~10% of the data). This is a simple solution that allowed us to retain much
   of the dataset, without introducing noise.
+  Captions like *'Dora Riparia', 'Anna Maria Mozzoni', 'Joey Ramone Place', 'Kim Rhodes', 'Ralph George Hawtrey' * have been removed.
 + [MSCOCO-IT](https://github.com/crux82/mscoco-it). This image-caption dataset comes from the work by [Scaiella et al., 2019](http://www.ai-lc.it/IJCoL/v5n2/IJCOL_5_2_3___scaiella_et_al.pdf). The captions come from the original
 MSCOCO dataset and have been translated with Microsoft Translator. The 2017 version of the MSCOCO training set contains more than