File size: 5,994 Bytes
1076673
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175

		 STS Benchmark: Main English dataset
			    
	    Semantic Textual Similarity 2012-2017 Dataset

		    http://ixa2.si.ehu.eus/stswiki
				   

STS Benchmark comprises a selection of the English datasets used in
the STS tasks organized by us in the context of SemEval between 2012
and 2017.

In order to provide a standard benchmark to compare among systems, we
organized it into train, development and test. The development part
can be used to develop and tune hyperparameters of the systems, and
the test part should be only used once for the final system.

The benchmark comprises 8628 sentence pairs. This is the breakdown
according to genres and train-dev-test splits:

                train  dev test total 
        -----------------------------
        news     3299  500  500  4299
        caption  2000  625  525  3250
        forum     450  375  254  1079
        -----------------------------
        total    5749 1500 1379  8628

For reference, this is the breakdown according to the original names
and task years of the datasets:

  genre     file           years   train  dev test
  ------------------------------------------------
  news      MSRpar         2012     1000  250  250
  news      headlines      2013-16  1999  250  250 
  news      deft-news      2014      300    0    0
  captions  MSRvid         2012     1000  250  250
  captions  images         2014-15  1000  250  250
  captions  track5.en-en   2017        0  125  125
  forum     deft-forum     2014      450    0    0
  forum     answers-forums 2015        0  375    0
  forum     answer-answer  2016        0    0  254
  
In addition to the standard benchmark, we also include other datasets
(see readme.txt in "companion" directory).


Introduction
------------

Given two sentences of text, s1 and s2, the systems need to compute
how similar s1 and s2 are, returning a similarity score between 0 and
5. The dataset comprises naturally occurring pairs of sentences drawn
from several domains and genres, annotated by crowdsourcing. See
papers by Agirre et al. (2012; 2013; 2014; 2015; 2016; 2017).

Format
------

Each file is encoded in utf-8 (a superset of ASCII), and has the
following tab separated fields:

  genre filename year score sentence1 sentence2

optionally there might be some license-related fields after sentence2.

NOTE: Given that some sentence pairs have been reused here and
elsewhere, systems should NOT use the following datasets to develop or
train their systems (see below for more details on datasets):

- Any of the datasets in Semeval STS competitions, including Semeval
  2014 task 1 (also known as SICK).
- The test part of MSR-Paraphrase (development and train are fine).
- The text of the videos in MSR-Video.


Evaluation script
-----------------

The official evaluation is the Pearson correlation coefficient. Given
an output file comprising the system scores (one per line) in a file
called sys.txt, you can use the evaluation script as follows:

$ perl correlation.pl sts-dev.txt sys.txt


Other
-----

Please check http://ixa2.si.ehu.eus/stswiki

We recommend that interested researchers join the (low traffic)
mailing list:

 http://groups.google.com/group/STS-semeval

Notse on datasets and licenses
------------------------------

If using this data in your research please cite (Agirre et al. 2017)
and the STS website: http://ixa2.si.ehu.eus/stswiki.

Please see LICENSE.txt
  

Organizers of tasks by year
---------------------------

2012 Eneko Agirre, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre

2013 Eneko Agirre, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre,
     WeiWei Guo

2014 Eneko Agirre, Carmen Banea, Claire Cardie, Daniel Cer, Mona Diab,
     Aitor Gonzalez-Agirre, Weiwei Guo, Rada Mihalcea, German Rigau,
     Janyce Wiebe

2015 Eneko Agirre, Carmen Banea, Claire Cardie, Daniel Cer, Mona Diab,
     Aitor Gonzalez-Agirre, Weiwei Guo, Inigo Lopez-Gazpio, Montse
     Maritxalar, Rada Mihalcea, German Rigau, Larraitz Uria, Janyce
     Wiebe

2016 Eneko Agirre, Carmen Banea, Daniel Cer, Mona Diab, Aitor
     Gonzalez-Agirre, Rada Mihalcea, German Rigau, Janyce
     Wiebe

2017 Eneko Agirre, Daniel Cer, Mona Diab, Iñigo Lopez-Gazpio, Lucia
     Specia


References
----------

Eneko Agirre, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre. Task 6: A
   Pilot on Semantic Textual Similarity. Procceedings of Semeval 2012

Eneko Agirre, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, WeiWei
   Guo. *SEM 2013 shared task: Semantic Textual
   Similarity. Procceedings of *SEM 2013

Eneko Agirre, Carmen Banea, Claire Cardie, Daniel Cer, Mona Diab,
   Aitor Gonzalez-Agirre, Weiwei Guo, Rada Mihalcea, German Rigau,
   Janyce Wiebe. Task 10: Multilingual Semantic Textual
   Similarity. Proceedings of SemEval 2014.

Eneko Agirre, Carmen Banea, Claire Cardie, Daniel Cer, Mona Diab,
    Aitor Gonzalez-Agirre, Weiwei Guo, Inigo Lopez-Gazpio, Montse
    Maritxalar, Rada Mihalcea, German Rigau, Larraitz Uria, Janyce
    Wiebe. Task 2: Semantic Textual Similarity, English, Spanish and
    Pilot on Interpretability. Proceedings of SemEval 2015.

Eneko Agirre, Carmen Banea, Daniel Cer, Mona Diab, Aitor
    Gonzalez-Agirre, Rada Mihalcea, German Rigau, Janyce
    Wiebe. Semeval-2016 Task 1: Semantic Textual Similarity,
    Monolingual and Cross-Lingual Evaluation. Proceedings of SemEval
    2016.

Eneko Agirre, Daniel Cer, Mona Diab, Iñigo Lopez-Gazpio, Lucia
    Specia. Semeval-2017 Task 1: Semantic Textual Similarity
    Multilingual and Crosslingual Focused Evaluation. Proceedings of
    SemEval 2017.

Clive Best, Erik van der Goot, Ken Blackler, Tefilo Garcia, and David
    Horby. 2005. Europe media monitor - system description. In EUR
    Report 22173-En, Ispra, Italy.

Cyrus Rashtchian, Peter Young, Micah Hodosh, and Julia Hockenmaier.
    Collecting Image Annotations Using Amazon's Mechanical Turk.  In
    Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and
    Language Data with Amazon's Mechanical Turk.