File size: 4,037 Bytes
088df96
a8f17fa
088df96
952eb04
53a42f6
952eb04
6894a2f
952eb04
66254c0
952eb04
10cf51c
088df96
952eb04
 
 
1e22557
952eb04
 
 
6894a2f
952eb04
6894a2f
952eb04
 
 
1e22557
952eb04
 
 
6894a2f
952eb04
 
 
 
 
 
 
37cd418
952eb04
07202c1
 
 
 
 
 
952eb04
 
 
 
dd48ac3
952eb04
 
 
 
 
07202c1
6894a2f
1e22557
952eb04
1e22557
952eb04
27e7f59
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
952eb04
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
---
language: pl
license: cc-by-4.0
pipeline_tag: fill-mask
mask_token: "<mask>"
widget:
- text: "Sztuczna inteligencja to <mask>."
- text: "Robert Kubica jest najlepszym <mask>."
- text: "<mask> jest największym zdrajcą."
- text: "<mask> to najlepszy polski klub."
- text: "Twoja <mask>"
---

# TrelBERT

TrelBERT is a BERT-based Language Model trained on data from Polish Twitter using Masked Language Modeling objective. It is based on [HerBERT](https://aclanthology.org/2021.bsnlp-1.1) model and therefore released under the same license - CC BY 4.0.

## Training

We trained our model starting from [`herbert-base-cased`](https://huggingface.co/allegro/herbert-base-cased) checkpoint and continued MLM training using data collected from Twitter.

The data we used for MLM fine-tuning was approximately 45 million Polish tweets. We trained the model for 1 epoch with a learning rate `5e-5` and batch size `2184` using AdamW optimizer.

### Preprocessing

For each Tweet, the user handles that occur in the beginning of the text were removed, as they are not part of the message content but only represent who the user is replying to. The remaining user handles were replaced by "@anonymized_account". Links were replaced with a special @URL token.

## Tokenizer

We use HerBERT tokenizer with two special tokens added for preprocessing purposes as described above (@anonymized_account, @URL). Maximum sequence length is set to 128, based on the analysis of Twitter data distribution.

## License

CC BY 4.0

## KLEJ Benchmark results

We fine-tuned TrelBERT to [KLEJ benchmark](https://klejbenchmark.com) tasks and achieved the following results:

<style>
    tr:last-child {
        border-top-width: 4px;
    }
</style>
|Task name|Score| 
|--|--|
|NKJP-NER|94.4|
|CDSC-E|93.9|
|CDSC-R|93.6|
|CBD|76.1|
|PolEmo2.0-IN|89.3|
|PolEmo2.0-OUT|78.1|
|DYK|67.4|
|PSC|95.7|
|AR|86.1|
|__Average__|__86.1__|

For fine-tuning to KLEJ tasks we used [Polish RoBERTa](https://github.com/sdadas/polish-roberta) scripts, which we modified to use `transformers` library. For the CBD task, we set the maximum sequence length to 128 and implemented the same preprocessing procedure as in the MLM phase.

Our model achieved 1st place in cyberbullying detection (CBD) task in the [KLEJ leaderboard](https://klejbenchmark.com/leaderboard). Overall, it reached 7th place, just below HerBERT model.

## Citation
Please cite the following paper:
```
@inproceedings{szmyd-etal-2023-trelbert,
    title = "{T}rel{BERT}: A pre-trained encoder for {P}olish {T}witter",
    author = "Szmyd, Wojciech  and
      Kotyla, Alicja  and
      Zobni{\'o}w, Micha{\l}  and
      Falkiewicz, Piotr  and
      Bartczuk, Jakub  and
      Zygad{\l}o, Artur",
    booktitle = "Proceedings of the 9th Workshop on Slavic Natural Language Processing 2023 (SlavicNLP 2023)",
    month = may,
    year = "2023",
    address = "Dubrovnik, Croatia",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.bsnlp-1.3",
    pages = "17--24",
    abstract = "Pre-trained Transformer-based models have become immensely popular amongst NLP practitioners. We present TrelBERT {--} the first Polish language model suited for application in the social media domain. TrelBERT is based on an existing general-domain model and adapted to the language of social media by pre-training it further on a large collection of Twitter data. We demonstrate its usefulness by evaluating it in the downstream task of cyberbullying detection, in which it achieves state-of-the-art results, outperforming larger monolingual models trained on general-domain corpora, as well as multilingual in-domain models, by a large margin. We make the model publicly available. We also release a new dataset for the problem of harmful speech detection.",
}

```

## Authors

Jakub Bartczuk, Krzysztof Dziedzic, Piotr Falkiewicz, Alicja Kotyla, Wojciech Szmyd, Michał Zobniów, Artur Zygadło

For more information, reach out to us via e-mail: artur.zygadlo@deepsense.ai