File size: 1,928 Bytes
088df96 952eb04 088df96 952eb04 4491b19 952eb04 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 |
---
license: cc-by-4.0
pipeline_tag: fill-mask
widget:
- text: "Robert Kubica jest najlepszym <mask>."
- text: "<mask> jest największym zdrajcą <mask>."
- text: "Sztuczna inteligencja to <mask>."
- text: "Twoja <mask>."
- text: "<mask> to najlepszy polski klub."
---
# TrelBERT
TrelBERT is a BERT-based Language Model trained on Polish Twitter.
It uses Masked Language Model objective. It is based on [HerBERT](https://aclanthology.org/2021.bsnlp-1.1) model and therefore released under the same license - CC BY 4.0.
## Training
We trained our model starting from [`herbert-base-cased`](https://huggingface.co/allegro/herbert-base-cased) checkpoint and continued Masked Language Modeling training using data collected from Twitter.
The data we used for MLM fine-tuning was approximately 45 million Polish tweets. We trained the model for 1 epoch with a learning rate 5e-5 and batch size 2184 using AdamW optimizer.
### Preprocessing
The user handles that occur in the beginning of the tweet are removed. The rest is replaced by "@anonymized_account". Links are replaced with a special @URL token.
## Tokenizer
We use HerBERT tokenizer with special tokens added as above (@anonymized_account, @URL). Maximum sequence length is set to 128.
## License
CC BY 4.0
## KLEJ Benchmark results
We fine-tuned TrelBERT to [KLEJ benchmark](klejbenchmark.com) tasks and achieved the following results:
|model name | score |
|--|--|
|NKJP-NER|94.4|
|CDSC-E|93.9|
|CDSC-R|93.6|
|CBD|71.5|
|PolEmo2.0-IN|89.3|
|PolEmo2.0-OUT|78.1|
|DYK|67.4|
|PSC|95.7|
|AR|86.1|
For fine-tuning to KLEJ tasks we used [Polish RoBERTa](https://github.com/sdadas/polish-roberta) scripts, which we modified to use `transformers` library.
## Authors
Jakub Bartczuk, Krzysztof Dziedzic, Piotr Falkiewicz, Alicja Kotyla, Wojciech Szmyd, Michał Zobniów, Artur Zygadło
For more information, reach out to us via e-mail: artur.zygadlo@deepsense.ai |