STS Benchmark: Main English dataset Semantic Textual Similarity 2012-2017 Dataset http://ixa2.si.ehu.eus/stswiki STS Benchmark comprises a selection of the English datasets used in the STS tasks organized by us in the context of SemEval between 2012 and 2017. In order to provide a standard benchmark to compare among systems, we organized it into train, development and test. The development part can be used to develop and tune hyperparameters of the systems, and the test part should be only used once for the final system. The benchmark comprises 8628 sentence pairs. This is the breakdown according to genres and train-dev-test splits: train dev test total ----------------------------- news 3299 500 500 4299 caption 2000 625 525 3250 forum 450 375 254 1079 ----------------------------- total 5749 1500 1379 8628 For reference, this is the breakdown according to the original names and task years of the datasets: genre file years train dev test ------------------------------------------------ news MSRpar 2012 1000 250 250 news headlines 2013-16 1999 250 250 news deft-news 2014 300 0 0 captions MSRvid 2012 1000 250 250 captions images 2014-15 1000 250 250 captions track5.en-en 2017 0 125 125 forum deft-forum 2014 450 0 0 forum answers-forums 2015 0 375 0 forum answer-answer 2016 0 0 254 In addition to the standard benchmark, we also include other datasets (see readme.txt in "companion" directory). Introduction ------------ Given two sentences of text, s1 and s2, the systems need to compute how similar s1 and s2 are, returning a similarity score between 0 and 5. The dataset comprises naturally occurring pairs of sentences drawn from several domains and genres, annotated by crowdsourcing. See papers by Agirre et al. (2012; 2013; 2014; 2015; 2016; 2017). Format ------ Each file is encoded in utf-8 (a superset of ASCII), and has the following tab separated fields: genre filename year score sentence1 sentence2 optionally there might be some license-related fields after sentence2. NOTE: Given that some sentence pairs have been reused here and elsewhere, systems should NOT use the following datasets to develop or train their systems (see below for more details on datasets): - Any of the datasets in Semeval STS competitions, including Semeval 2014 task 1 (also known as SICK). - The test part of MSR-Paraphrase (development and train are fine). - The text of the videos in MSR-Video. Evaluation script ----------------- The official evaluation is the Pearson correlation coefficient. Given an output file comprising the system scores (one per line) in a file called sys.txt, you can use the evaluation script as follows: $ perl correlation.pl sts-dev.txt sys.txt Other ----- Please check http://ixa2.si.ehu.eus/stswiki We recommend that interested researchers join the (low traffic) mailing list: http://groups.google.com/group/STS-semeval Notse on datasets and licenses ------------------------------ If using this data in your research please cite (Agirre et al. 2017) and the STS website: http://ixa2.si.ehu.eus/stswiki. Please see LICENSE.txt Organizers of tasks by year --------------------------- 2012 Eneko Agirre, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre 2013 Eneko Agirre, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, WeiWei Guo 2014 Eneko Agirre, Carmen Banea, Claire Cardie, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, Weiwei Guo, Rada Mihalcea, German Rigau, Janyce Wiebe 2015 Eneko Agirre, Carmen Banea, Claire Cardie, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, Weiwei Guo, Inigo Lopez-Gazpio, Montse Maritxalar, Rada Mihalcea, German Rigau, Larraitz Uria, Janyce Wiebe 2016 Eneko Agirre, Carmen Banea, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, Rada Mihalcea, German Rigau, Janyce Wiebe 2017 Eneko Agirre, Daniel Cer, Mona Diab, Iñigo Lopez-Gazpio, Lucia Specia References ---------- Eneko Agirre, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre. Task 6: A Pilot on Semantic Textual Similarity. Procceedings of Semeval 2012 Eneko Agirre, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, WeiWei Guo. *SEM 2013 shared task: Semantic Textual Similarity. Procceedings of *SEM 2013 Eneko Agirre, Carmen Banea, Claire Cardie, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, Weiwei Guo, Rada Mihalcea, German Rigau, Janyce Wiebe. Task 10: Multilingual Semantic Textual Similarity. Proceedings of SemEval 2014. Eneko Agirre, Carmen Banea, Claire Cardie, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, Weiwei Guo, Inigo Lopez-Gazpio, Montse Maritxalar, Rada Mihalcea, German Rigau, Larraitz Uria, Janyce Wiebe. Task 2: Semantic Textual Similarity, English, Spanish and Pilot on Interpretability. Proceedings of SemEval 2015. Eneko Agirre, Carmen Banea, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, Rada Mihalcea, German Rigau, Janyce Wiebe. Semeval-2016 Task 1: Semantic Textual Similarity, Monolingual and Cross-Lingual Evaluation. Proceedings of SemEval 2016. Eneko Agirre, Daniel Cer, Mona Diab, Iñigo Lopez-Gazpio, Lucia Specia. Semeval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation. Proceedings of SemEval 2017. Clive Best, Erik van der Goot, Ken Blackler, Tefilo Garcia, and David Horby. 2005. Europe media monitor - system description. In EUR Report 22173-En, Ispra, Italy. Cyrus Rashtchian, Peter Young, Micah Hodosh, and Julia Hockenmaier. Collecting Image Annotations Using Amazon's Mechanical Turk. In Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk.