File size: 1,928 Bytes
088df96
 
952eb04
 
 
 
 
 
 
088df96
952eb04
 
 
 
4491b19
952eb04
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
---
license: cc-by-4.0
pipeline_tag: fill-mask
widget:
- text: "Robert Kubica jest najlepszym <mask>."
- text: "<mask> jest największym zdrajcą <mask>."
- text: "Sztuczna inteligencja to <mask>."
- text: "Twoja <mask>."
- text: "<mask> to najlepszy polski klub."
---

# TrelBERT

TrelBERT is a BERT-based Language Model trained on Polish Twitter.
It uses Masked Language Model objective. It is based on [HerBERT](https://aclanthology.org/2021.bsnlp-1.1) model and therefore released under the same license - CC BY 4.0.

## Training

We trained our model starting from [`herbert-base-cased`](https://huggingface.co/allegro/herbert-base-cased) checkpoint and continued Masked Language Modeling training using data collected from Twitter.

The data we used for MLM fine-tuning was approximately 45 million Polish tweets. We trained the model for 1 epoch with a learning rate 5e-5 and batch size 2184 using AdamW optimizer.

### Preprocessing

The user handles that occur in the beginning of the tweet are removed. The rest is replaced by "@anonymized_account". Links are replaced with a special @URL token.

## Tokenizer

We use HerBERT tokenizer with special tokens added as above (@anonymized_account, @URL). Maximum sequence length is set to 128.

## License

CC BY 4.0

## KLEJ Benchmark results

We fine-tuned TrelBERT to [KLEJ benchmark](klejbenchmark.com) tasks and achieved the following results:

|model name | score | 
|--|--|
|NKJP-NER|94.4|
|CDSC-E|93.9|
|CDSC-R|93.6|
|CBD|71.5|
|PolEmo2.0-IN|89.3|
|PolEmo2.0-OUT|78.1|
|DYK|67.4|
|PSC|95.7|
|AR|86.1|

For fine-tuning to KLEJ tasks we used [Polish RoBERTa](https://github.com/sdadas/polish-roberta) scripts, which we modified to use `transformers` library.

## Authors

Jakub Bartczuk, Krzysztof Dziedzic, Piotr Falkiewicz, Alicja Kotyla, Wojciech Szmyd, Michał Zobniów, Artur Zygadło

For more information, reach out to us via e-mail: artur.zygadlo@deepsense.ai