Silvia Terragni commited on
Commit
aacfe19
1 Parent(s): 6c1a3f9

reduce WIT description in readme

Browse files
Files changed (1) hide show
  1. introduction.md +2 -3
introduction.md CHANGED
@@ -66,12 +66,11 @@ We considered four main sources of data:
66
  [Srinivasan et al., 2021](https://arxiv.org/pdf/2103.01913.pdf)). We focused on the *Reference Description* captions
67
  described in the paper as they are the ones of highest quality. Nonetheless, many of these captions describe ontological knowledge and encyclopedic facts (e.g., Roberto Baggio in 1994).
68
  However, this kind of text, without more information, is not useful to learn a good mapping between images and captions.
69
- On the other hand, this text is written in Italian and it is of good quality. We cannot just remove short captions as some of those
70
- are still good (e.g., "running dog"). Thus, to prevent polluting the data with captions that are not meaningful, we used *POS tagging*
71
  on the text and removed all the captions that were composed for the 80% or more by PROPN (around ~10% of the data). This is a simple solution that allowed us to retain much
72
  of the dataset, without introducing noise.
73
 
74
- Captions like: *'Dora Riparia', 'Anna Maria Mozzoni', 'Joey Ramone Place', 'Kim Rhodes', 'Ralph George Hawtrey' * have been removed.
75
 
76
  + [MSCOCO-IT](https://github.com/crux82/mscoco-it). This image-caption dataset comes from the work by [Scaiella et al., 2019](http://www.ai-lc.it/IJCoL/v5n2/IJCOL_5_2_3___scaiella_et_al.pdf). The captions come from the original
77
  MSCOCO dataset and have been translated with Microsoft Translator. The 2017 version of the MSCOCO training set contains more than
 
66
  [Srinivasan et al., 2021](https://arxiv.org/pdf/2103.01913.pdf)). We focused on the *Reference Description* captions
67
  described in the paper as they are the ones of highest quality. Nonetheless, many of these captions describe ontological knowledge and encyclopedic facts (e.g., Roberto Baggio in 1994).
68
  However, this kind of text, without more information, is not useful to learn a good mapping between images and captions.
69
+ To prevent polluting the data with captions that are not meaningful, we used *POS tagging*
 
70
  on the text and removed all the captions that were composed for the 80% or more by PROPN (around ~10% of the data). This is a simple solution that allowed us to retain much
71
  of the dataset, without introducing noise.
72
 
73
+ Captions like *'Dora Riparia', 'Anna Maria Mozzoni', 'Joey Ramone Place', 'Kim Rhodes', 'Ralph George Hawtrey' * have been removed.
74
 
75
  + [MSCOCO-IT](https://github.com/crux82/mscoco-it). This image-caption dataset comes from the work by [Scaiella et al., 2019](http://www.ai-lc.it/IJCoL/v5n2/IJCOL_5_2_3___scaiella_et_al.pdf). The captions come from the original
76
  MSCOCO dataset and have been translated with Microsoft Translator. The 2017 version of the MSCOCO training set contains more than