trelbert / README.md
dsai-artur-zygadlo's picture
Update README.md
4491b19
metadata
license: cc-by-4.0
pipeline_tag: fill-mask
widget:
  - text: Robert Kubica jest najlepszym <mask>.
  - text: <mask> jest największym zdrajcą <mask>.
  - text: Sztuczna inteligencja to <mask>.
  - text: Twoja <mask>.
  - text: <mask> to najlepszy polski klub.

TrelBERT

TrelBERT is a BERT-based Language Model trained on Polish Twitter. It uses Masked Language Model objective. It is based on HerBERT model and therefore released under the same license - CC BY 4.0.

Training

We trained our model starting from herbert-base-cased checkpoint and continued Masked Language Modeling training using data collected from Twitter.

The data we used for MLM fine-tuning was approximately 45 million Polish tweets. We trained the model for 1 epoch with a learning rate 5e-5 and batch size 2184 using AdamW optimizer.

Preprocessing

The user handles that occur in the beginning of the tweet are removed. The rest is replaced by "@anonymized_account". Links are replaced with a special @URL token.

Tokenizer

We use HerBERT tokenizer with special tokens added as above (@anonymized_account, @URL). Maximum sequence length is set to 128.

License

CC BY 4.0

KLEJ Benchmark results

We fine-tuned TrelBERT to KLEJ benchmark tasks and achieved the following results:

model name score
NKJP-NER 94.4
CDSC-E 93.9
CDSC-R 93.6
CBD 71.5
PolEmo2.0-IN 89.3
PolEmo2.0-OUT 78.1
DYK 67.4
PSC 95.7
AR 86.1

For fine-tuning to KLEJ tasks we used Polish RoBERTa scripts, which we modified to use transformers library.

Authors

Jakub Bartczuk, Krzysztof Dziedzic, Piotr Falkiewicz, Alicja Kotyla, Wojciech Szmyd, Michał Zobniów, Artur Zygadło

For more information, reach out to us via e-mail: artur.zygadlo@deepsense.ai