Spaces:
Running
Running
Silvia Terragni
commited on
Commit
·
aacfe19
1
Parent(s):
6c1a3f9
reduce WIT description in readme
Browse files- introduction.md +2 -3
introduction.md
CHANGED
@@ -66,12 +66,11 @@ We considered four main sources of data:
|
|
66 |
[Srinivasan et al., 2021](https://arxiv.org/pdf/2103.01913.pdf)). We focused on the *Reference Description* captions
|
67 |
described in the paper as they are the ones of highest quality. Nonetheless, many of these captions describe ontological knowledge and encyclopedic facts (e.g., Roberto Baggio in 1994).
|
68 |
However, this kind of text, without more information, is not useful to learn a good mapping between images and captions.
|
69 |
-
|
70 |
-
are still good (e.g., "running dog"). Thus, to prevent polluting the data with captions that are not meaningful, we used *POS tagging*
|
71 |
on the text and removed all the captions that were composed for the 80% or more by PROPN (around ~10% of the data). This is a simple solution that allowed us to retain much
|
72 |
of the dataset, without introducing noise.
|
73 |
|
74 |
-
Captions like
|
75 |
|
76 |
+ [MSCOCO-IT](https://github.com/crux82/mscoco-it). This image-caption dataset comes from the work by [Scaiella et al., 2019](http://www.ai-lc.it/IJCoL/v5n2/IJCOL_5_2_3___scaiella_et_al.pdf). The captions come from the original
|
77 |
MSCOCO dataset and have been translated with Microsoft Translator. The 2017 version of the MSCOCO training set contains more than
|
|
|
66 |
[Srinivasan et al., 2021](https://arxiv.org/pdf/2103.01913.pdf)). We focused on the *Reference Description* captions
|
67 |
described in the paper as they are the ones of highest quality. Nonetheless, many of these captions describe ontological knowledge and encyclopedic facts (e.g., Roberto Baggio in 1994).
|
68 |
However, this kind of text, without more information, is not useful to learn a good mapping between images and captions.
|
69 |
+
To prevent polluting the data with captions that are not meaningful, we used *POS tagging*
|
|
|
70 |
on the text and removed all the captions that were composed for the 80% or more by PROPN (around ~10% of the data). This is a simple solution that allowed us to retain much
|
71 |
of the dataset, without introducing noise.
|
72 |
|
73 |
+
Captions like *'Dora Riparia', 'Anna Maria Mozzoni', 'Joey Ramone Place', 'Kim Rhodes', 'Ralph George Hawtrey' * have been removed.
|
74 |
|
75 |
+ [MSCOCO-IT](https://github.com/crux82/mscoco-it). This image-caption dataset comes from the work by [Scaiella et al., 2019](http://www.ai-lc.it/IJCoL/v5n2/IJCOL_5_2_3___scaiella_et_al.pdf). The captions come from the original
|
76 |
MSCOCO dataset and have been translated with Microsoft Translator. The 2017 version of the MSCOCO training set contains more than
|