--- license: cc-by-4.0 pipeline_tag: fill-mask widget: - text: "Robert Kubica jest najlepszym ." - text: " jest największym zdrajcą ." - text: "Sztuczna inteligencja to ." - text: "Twoja ." - text: " to najlepszy polski klub." --- # TrelBERT TrelBERT is a BERT-based Language Model trained on Polish Twitter. It uses Masked Language Model objective. It is based on [HerBERT](https://aclanthology.org/2021.bsnlp-1.1) model and therefore released under the same license - CC BY 4.0. ## Training We trained our model starting from [`herbert-base-cased`](https://huggingface.co/allegro/herbert-base-cased) checkpoint and continued Masked Language Modeling training using data collected from Twitter. The data we used for MLM fine-tuning was approximately 45 million Polish tweets. We trained the model for 1 epoch with a learning rate 5e-5 and batch size 2184 using AdamW optimizer. ### Preprocessing The user handles that occur in the beginning of the tweet are removed. The rest is replaced by "@anonymized_account". Links are replaced with a special @URL token. ## Tokenizer We use HerBERT tokenizer with special tokens added as above (@anonymized_account, @URL). Maximum sequence length is set to 128. ## License CC BY 4.0 ## KLEJ Benchmark results We fine-tuned TrelBERT to [KLEJ benchmark](klejbenchmark.com) tasks and achieved the following results: |model name | score | |--|--| |NKJP-NER|94.4| |CDSC-E|93.9| |CDSC-R|93.6| |CBD|71.5| |PolEmo2.0-IN|89.3| |PolEmo2.0-OUT|78.1| |DYK|67.4| |PSC|95.7| |AR|86.1| For fine-tuning to KLEJ tasks we used [Polish RoBERTa](https://github.com/sdadas/polish-roberta) scripts, which we modified to use `transformers` library. ## Authors Jakub Bartczuk, Krzysztof Dziedzic, Piotr Falkiewicz, Alicja Kotyla, Wojciech Szmyd, Michał Zobniów, Artur Zygadło For more information, reach out to us via e-mail: artur.zygadlo@deepsense.ai