File size: 4,346 Bytes
433993b
 
 
 
f840887
 
433993b
 
 
 
 
 
 
 
 
 
 
69dd5d1
 
f335e4f
 
69dd5d1
 
 
 
 
 
 
 
 
 
 
 
 
9f52149
69dd5d1
 
 
 
 
 
c709e1a
69dd5d1
 
 
 
 
 
 
 
 
 
 
 
 
507d26b
 
69dd5d1
 
 
507d26b
 
 
69dd5d1
507d26b
5d65110
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
528c17b
 
 
5d65110
 
 
610b928
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
---

language: "en"

license: "cc-by-sa-4.0"

tags:
- text-classification
- hate-speech

widget:
- text: "Gay is okay."
---




# roberta-base-frenk-hate

Text classification model based on [`roberta-base`](https://huggingface.co/roberta-base) and fine-tuned on the [FRENK dataset](https://www.clarin.si/repository/xmlui/handle/11356/1433) comprising of LGBT and migrant hatespeech. Only the English subset of the data was used for fine-tuning and the dataset has been relabeled for binary classification (offensive or acceptable).

## Fine-tuning hyperparameters

Fine-tuning was performed with `simpletransformers`. Beforehand a brief hyperparameter optimisation was performed and the presumed optimal hyperparameters are:

```python
model_args = {
        "num_train_epochs": 6,
        "learning_rate": 3e-6,
        "train_batch_size": 69}
```

## Performance

The same pipeline was run with two other transformer models and `fasttext` for comparison. Accuracy and macro F1 score were recorded for each of the 6 fine-tuning sessions and post festum analyzed.

| model | average accuracy | average macro F1|
|---|---|---|
|roberta-base-frenk-hate|0.7915|0.7785|
|xlm-roberta-large |0.7904|0.77876|
|xlm-roberta-base |0.7577|0.7402|
|fasttext|0.725 |0.707 |



From recorded accuracies and macro F1 scores p-values were also calculated:

Comparison with `xlm-roberta-base`:

| test | accuracy p-value | macro F1 p-value|
| --- | --- | --- |
|Wilcoxon|0.00781|0.00781|
|Mann Whithney U-test|0.00108|0.00108|
|Student t-test | 1.35e-08 | 1.05e-07|


Comparison with `xlm-roberta-large` yielded inconclusive results.  `roberta-base` has average accuracy 0.7915, while `xlm-roberta-large` has average accuracy of 0.7904. If macro F1 scores were to be compared, `roberta-base` actually has lower average than `xlm-roberta-large`: 0.77852 vs 0.77876 respectively. The same statistical tests were performed with the premise that `roberta-base` has greater metrics, and the results are given below.

| test | accuracy p-value | macro F1 p-value|
| --- | --- | --- |
|Wilcoxon|0.188|0.406|
|Mann Whithey|0.375|0.649|
|Student t-test | 0.681| 0.934|

With reversed premise (i.e., that `xlm-roberta-large` has greater statistics) the Wilcoxon p-value for macro F1 scores for this case reaches 0.656, Mann-Whithey p-value is 0.399, and of course the Student p-value stays the same. It was therefore concluded that performance of the two models are not statistically significantly different from one another.

## Use examples

```python
from simpletransformers.classification import ClassificationModel
model_args = {
        "num_train_epochs": 6,
        "learning_rate": 3e-6,
        "train_batch_size": 69}

model = ClassificationModel(
    "roberta", "5roop/roberta-base-frenk-hate", use_cuda=True,
    args=model_args
    
)

predictions, logit_output = model.predict(["Build the wall", 
                                        "Build the wall of trust"]
                                        )
predictions
### Output:
### array([1, 0])
```

## Citation


If you use the model, please cite the following paper on which the original model is based:
```
@article{DBLP:journals/corr/abs-1907-11692,
  author    = {Yinhan Liu and
               Myle Ott and
               Naman Goyal and
               Jingfei Du and
               Mandar Joshi and
               Danqi Chen and
               Omer Levy and
               Mike Lewis and
               Luke Zettlemoyer and
               Veselin Stoyanov},
  title     = {RoBERTa: {A} Robustly Optimized {BERT} Pretraining Approach},
  journal   = {CoRR},
  volume    = {abs/1907.11692},
  year      = {2019},
  url       = {http://arxiv.org/abs/1907.11692},
  archivePrefix = {arXiv},
  eprint    = {1907.11692},
  timestamp = {Thu, 01 Aug 2019 08:59:33 +0200},
  biburl    = {https://dblp.org/rec/journals/corr/abs-1907-11692.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}
```

and the dataset used for fine-tuning:
```
@misc{ljubešić2019frenk,
      title={The FRENK Datasets of Socially Unacceptable Discourse in Slovene and English}, 
      author={Nikola Ljubešić and Darja Fišer and Tomaž Erjavec},
      year={2019},
      eprint={1906.02045},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/1906.02045}
}
```