luisespinosa
commited on
Commit
•
ab65fea
1
Parent(s):
eb73068
initial commit
Browse files- README.md +172 -0
- config.json +23 -0
- merges.txt +0 -0
- pytorch_model.bin +3 -0
- tf_model.h5 +3 -0
- vocab.json +0 -0
README.md
ADDED
@@ -0,0 +1,172 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# TweetEval
|
2 |
+
This is the repository for the [_TweetEval_ benchmark (Findings of EMNLP 2020)](https://arxiv.org/pdf/2010.12421.pdf). _TweetEval_ consists of seven heterogenous tasks in Twitter, all framed as multi-class tweet classification. All tasks have been unified into the same benchmark, with each dataset presented in the same format and with fixed training, validation and test splits.
|
3 |
+
|
4 |
+
# TweetEval: The Benchmark
|
5 |
+
|
6 |
+
These are the seven datasets of TweetEval, with its corresponding labels (more details about the format in the [datasets](https://github.com/cardiffnlp/tweeteval/tree/main/datasets) directory):
|
7 |
+
|
8 |
+
- **Emotion Recognition**: [SemEval 2018 (Emotion Recognition)](https://www.aclweb.org/anthology/S18-1001/) - 4 labels: `anger`, `joy`,`sadness`, `optimism`
|
9 |
+
|
10 |
+
- **Emoji Prediction**, [SemEval 2018 (Emoji Prediction)](https://www.aclweb.org/anthology/S18-1003.pdf) - 20 labels: :heart:, :heart_eyes:, :joy: `...` :evergreen_tree:, :camera:, :stuck_out_tongue_winking_eye:
|
11 |
+
|
12 |
+
- **Irony Detection**, [SemEval 2018 (Irony Detection)](https://www.aclweb.org/anthology/S18-1005.pdf) - 2 labels: `irony`, `not irony`
|
13 |
+
|
14 |
+
- **Hate Speech Detection**, [SemEval 2019 (Hateval)](https://www.aclweb.org/anthology/S19-2007.pdf) - 2 labels: `hateful`, `not hateful`
|
15 |
+
|
16 |
+
- **Offensive Language Identification**, [SemEval 2019 (OffensEval)](https://www.aclweb.org/anthology/S19-2010/) - 2 labels: `offensive`, `not offensive`
|
17 |
+
|
18 |
+
- **Sentiment Analysis**, [SemEval 2017 (Sentiment Analysis in Twitter)](https://www.aclweb.org/anthology/S17-2088/) - 3 labels: `positive`, `neutral`, `negative`
|
19 |
+
|
20 |
+
- **Stance Detection***, [SemEval 2016 (Detecting Stance in Tweets)](https://www.aclweb.org/anthology/S16-1003/) - 3 labels: `favour`, `neutral`, `against`
|
21 |
+
|
22 |
+
**Note***: For stance there are five different target topics (Abortion, Atheism, Climate change, Feminism and Hillary Clinton), each of which contains its own training, validation and test data.
|
23 |
+
|
24 |
+
# TweetEval: Leaderboard (Test set)
|
25 |
+
|
26 |
+
| Model | Emoji | Emotion | Hate | Irony | Offensive | Sentiment | Stance | ALL(TE) | Reference |
|
27 |
+
|----------|------:|--------:|-----:|------:|----------:|----------:|-------:|----:|---------|
|
28 |
+
| RoBERTa-Retrained | 31.4 | 78.5 | 52.3 | 61.7 | 80.5 | 72.6 | 69.3 | **65.2** | [TweetEval](https://arxiv.org/pdf/2010.12421.pdf) |
|
29 |
+
| RoBERTa-Base | 30.9 | 76.1 | 46.6 | 59.7 | 79.5 | 71.3 | 68 | 61.3 | [TweetEval](https://arxiv.org/pdf/2010.12421.pdf) |
|
30 |
+
| RoBERTa-Twitter | 29.3 | 72.0 | 49.9 | 65.4 | 77.1 | 69.1 | 66.7 | 61.0 | [TweetEval](https://arxiv.org/pdf/2010.12421.pdf) |
|
31 |
+
| FastText | 25.8 | 65.2 | 50.6 | 63.1 | 73.4 | 62.9 | 65.4 | 58.1 | [TweetEval](https://arxiv.org/pdf/2010.12421.pdf) |
|
32 |
+
| LSTM | 24.7 | 66.0 | 52.6 | 62.8 | 71.7 | 58.3 | 59.4 | 56.5 | [TweetEval](https://arxiv.org/pdf/2010.12421.pdf) |
|
33 |
+
| SVM | 29.3 | 64.7 | 36.7 | 61.7 | 52.3 | 62.9 | 67.3 | 53.5 | [TweetEval](https://arxiv.org/pdf/2010.12421.pdf) |
|
34 |
+
|
35 |
+
**Note***: Check the [reference paper](https://arxiv.org/pdf/2010.12421.pdf) for details on the official metrics for each task
|
36 |
+
|
37 |
+
If you would like to have your results added to the leaderboard you can either submit a pull request or send an email to any of the paper authors with results and the predictions of your model. Please also submit a reference to a paper describing your approach.
|
38 |
+
|
39 |
+
# Evaluating your system
|
40 |
+
|
41 |
+
For evaluating your system, you simply need a individual predictions file for each of the tasks. The format of the predictions file should be the same as the output examples in the predictions folder (one output label per line as per the original test file). The predictions included as an example correspond to the best model evaluated in the paper, i.e., RoBERTa re-trained on Twitter (RoB-Rt in the paper).
|
42 |
+
|
43 |
+
### Example usage
|
44 |
+
|
45 |
+
```bash
|
46 |
+
python evaluation_script.py
|
47 |
+
```
|
48 |
+
The script takes the TweetEval gold test labels and the predictions from the "predictions" folder by default, but you can set this to suit your needs as optional arguments.
|
49 |
+
|
50 |
+
### Optional arguments
|
51 |
+
|
52 |
+
Three optional arguments can be modified:
|
53 |
+
|
54 |
+
*--tweeteval_path*: Path to TweetEval datasets. Default: *"./datasets/"*
|
55 |
+
|
56 |
+
*--predictions_path*: Path to predictions directory. Default: *"./predictions/"*
|
57 |
+
|
58 |
+
*--task*: Use this to get single task detailed results *(emoji|emotion|hate|irony|offensive|sentiment|stance)*. Default: ""
|
59 |
+
|
60 |
+
Evaluation script sample usage from the terminal with parameters:
|
61 |
+
|
62 |
+
```bash
|
63 |
+
python evaluation_script.py --tweeteval_path ./datasets/ --predictions_path ./predictions/ --task emoji
|
64 |
+
```
|
65 |
+
(this script would output the breakdown of the results for the emoji prediction task only)
|
66 |
+
|
67 |
+
# Pre-trained models and code
|
68 |
+
|
69 |
+
Coming soon! Here we will release all the Twitter-trained RoBERTa models included in the paper and code to evaluate any pre-trained language model in _TweetEval_.
|
70 |
+
|
71 |
+
# Citing TweetEval
|
72 |
+
|
73 |
+
If you use TweetEval in your research, please use the following `bib` entry to cite the [reference paper](https://arxiv.org/pdf/2010.12421.pdf).
|
74 |
+
|
75 |
+
```
|
76 |
+
@inproceedings{barbieri2020tweeteval,
|
77 |
+
title={{TweetEval:Unified Benchmark and Comparative Evaluation for Tweet Classification}},
|
78 |
+
author={Barbieri, Francesco and Camacho-Collados, Jose and Espinosa-Anke, Luis and Neves, Leonardo},
|
79 |
+
booktitle={Proceedings of Findings of EMNLP},
|
80 |
+
year={2020}
|
81 |
+
}
|
82 |
+
```
|
83 |
+
# License
|
84 |
+
|
85 |
+
TweetEval is released without any restrictions but restrictions may apply to individual tasks (which are derived from existing datasets) or Twitter (main data source). We refer users to the original licenses accompanying each dataset and Twitter regulations.
|
86 |
+
|
87 |
+
|
88 |
+
# Citing TweetEval datasets
|
89 |
+
|
90 |
+
If you use any of the TweetEval datasets, please cite their original publications:
|
91 |
+
|
92 |
+
#### Emotion Recognition:
|
93 |
+
```
|
94 |
+
@inproceedings{mohammad2018semeval,
|
95 |
+
title={Semeval-2018 task 1: Affect in tweets},
|
96 |
+
author={Mohammad, Saif and Bravo-Marquez, Felipe and Salameh, Mohammad and Kiritchenko, Svetlana},
|
97 |
+
booktitle={Proceedings of the 12th international workshop on semantic evaluation},
|
98 |
+
pages={1--17},
|
99 |
+
year={2018}
|
100 |
+
}
|
101 |
+
|
102 |
+
```
|
103 |
+
#### Emoji Prediction:
|
104 |
+
```
|
105 |
+
@inproceedings{barbieri2018semeval,
|
106 |
+
title={Semeval 2018 task 2: Multilingual emoji prediction},
|
107 |
+
author={Barbieri, Francesco and Camacho-Collados, Jose and Ronzano, Francesco and Espinosa-Anke, Luis and
|
108 |
+
Ballesteros, Miguel and Basile, Valerio and Patti, Viviana and Saggion, Horacio},
|
109 |
+
booktitle={Proceedings of The 12th International Workshop on Semantic Evaluation},
|
110 |
+
pages={24--33},
|
111 |
+
year={2018}
|
112 |
+
}
|
113 |
+
```
|
114 |
+
|
115 |
+
#### Irony Detection:
|
116 |
+
```
|
117 |
+
@inproceedings{van2018semeval,
|
118 |
+
title={Semeval-2018 task 3: Irony detection in english tweets},
|
119 |
+
author={Van Hee, Cynthia and Lefever, Els and Hoste, V{\'e}ronique},
|
120 |
+
booktitle={Proceedings of The 12th International Workshop on Semantic Evaluation},
|
121 |
+
pages={39--50},
|
122 |
+
year={2018}
|
123 |
+
}
|
124 |
+
```
|
125 |
+
|
126 |
+
#### Hate Speech Detection:
|
127 |
+
```
|
128 |
+
@inproceedings{basile-etal-2019-semeval,
|
129 |
+
title = "{S}em{E}val-2019 Task 5: Multilingual Detection of Hate Speech Against Immigrants and Women in {T}witter",
|
130 |
+
author = "Basile, Valerio and Bosco, Cristina and Fersini, Elisabetta and Nozza, Debora and Patti, Viviana and
|
131 |
+
Rangel Pardo, Francisco Manuel and Rosso, Paolo and Sanguinetti, Manuela",
|
132 |
+
booktitle = "Proceedings of the 13th International Workshop on Semantic Evaluation",
|
133 |
+
year = "2019",
|
134 |
+
address = "Minneapolis, Minnesota, USA",
|
135 |
+
publisher = "Association for Computational Linguistics",
|
136 |
+
url = "https://www.aclweb.org/anthology/S19-2007",
|
137 |
+
doi = "10.18653/v1/S19-2007",
|
138 |
+
pages = "54--63"
|
139 |
+
}
|
140 |
+
```
|
141 |
+
#### Offensive Language Identification:
|
142 |
+
```
|
143 |
+
@inproceedings{zampieri2019semeval,
|
144 |
+
title={SemEval-2019 Task 6: Identifying and Categorizing Offensive Language in Social Media (OffensEval)},
|
145 |
+
author={Zampieri, Marcos and Malmasi, Shervin and Nakov, Preslav and Rosenthal, Sara and Farra, Noura and Kumar, Ritesh},
|
146 |
+
booktitle={Proceedings of the 13th International Workshop on Semantic Evaluation},
|
147 |
+
pages={75--86},
|
148 |
+
year={2019}
|
149 |
+
}
|
150 |
+
```
|
151 |
+
|
152 |
+
#### Sentiment Analysis:
|
153 |
+
```
|
154 |
+
@inproceedings{rosenthal2017semeval,
|
155 |
+
title={SemEval-2017 task 4: Sentiment analysis in Twitter},
|
156 |
+
author={Rosenthal, Sara and Farra, Noura and Nakov, Preslav},
|
157 |
+
booktitle={Proceedings of the 11th international workshop on semantic evaluation (SemEval-2017)},
|
158 |
+
pages={502--518},
|
159 |
+
year={2017}
|
160 |
+
}
|
161 |
+
```
|
162 |
+
|
163 |
+
#### Stance Detection:
|
164 |
+
```
|
165 |
+
@inproceedings{mohammad2016semeval,
|
166 |
+
title={Semeval-2016 task 6: Detecting stance in tweets},
|
167 |
+
author={Mohammad, Saif and Kiritchenko, Svetlana and Sobhani, Parinaz and Zhu, Xiaodan and Cherry, Colin},
|
168 |
+
booktitle={Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)},
|
169 |
+
pages={31--41},
|
170 |
+
year={2016}
|
171 |
+
}
|
172 |
+
```
|
config.json
ADDED
@@ -0,0 +1,23 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"_name_or_path": "tweeteval/roberta-base-rt/",
|
3 |
+
"architectures": [
|
4 |
+
"RobertaForMaskedLM"
|
5 |
+
],
|
6 |
+
"attention_probs_dropout_prob": 0.1,
|
7 |
+
"bos_token_id": 0,
|
8 |
+
"eos_token_id": 2,
|
9 |
+
"gradient_checkpointing": false,
|
10 |
+
"hidden_act": "gelu",
|
11 |
+
"hidden_dropout_prob": 0.1,
|
12 |
+
"hidden_size": 768,
|
13 |
+
"initializer_range": 0.02,
|
14 |
+
"intermediate_size": 3072,
|
15 |
+
"layer_norm_eps": 1e-05,
|
16 |
+
"max_position_embeddings": 514,
|
17 |
+
"model_type": "roberta",
|
18 |
+
"num_attention_heads": 12,
|
19 |
+
"num_hidden_layers": 12,
|
20 |
+
"pad_token_id": 1,
|
21 |
+
"type_vocab_size": 1,
|
22 |
+
"vocab_size": 50265
|
23 |
+
}
|
merges.txt
ADDED
The diff for this file is too large to render.
See raw diff
|
|
pytorch_model.bin
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:3c9cc492ef7a1a9cd3e08dbabce6e8eef942e5d355ffdab19dbfbdd226eb7ae1
|
3 |
+
size 501204462
|
tf_model.h5
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:2150e429c436d73a051d7f8a0da6459f728e070ba5dc9f86d5d96912e9a55e0b
|
3 |
+
size 498845616
|
vocab.json
ADDED
The diff for this file is too large to render.
See raw diff
|
|