dsai-artur-zygadlo
commited on
Commit
•
952eb04
1
Parent(s):
088df96
Update model card
Browse files
README.md
CHANGED
@@ -1,3 +1,57 @@
|
|
1 |
---
|
2 |
license: cc-by-4.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: cc-by-4.0
|
3 |
+
pipeline_tag: fill-mask
|
4 |
+
widget:
|
5 |
+
- text: "Robert Kubica jest najlepszym <mask>."
|
6 |
+
- text: "<mask> jest największym zdrajcą <mask>."
|
7 |
+
- text: "Sztuczna inteligencja to <mask>."
|
8 |
+
- text: "Twoja <mask>."
|
9 |
+
- text: "<mask> to najlepszy polski klub."
|
10 |
---
|
11 |
+
|
12 |
+
# TrelBERT
|
13 |
+
|
14 |
+
TrelBERT is a BERT-based Language Model trained on Polish Twitter.
|
15 |
+
It uses Masked Language Model objective. It is based on [HerBERT](https://arxiv.org/abs/2105.01735) model and therefore released under the same license - CC BY 4.0.
|
16 |
+
|
17 |
+
## Training
|
18 |
+
|
19 |
+
We trained our model starting from [`herbert-base-cased`](https://huggingface.co/allegro/herbert-base-cased) checkpoint and continued Masked Language Modeling training using data collected from Twitter.
|
20 |
+
|
21 |
+
The data we used for MLM fine-tuning was approximately 45 million Polish tweets. We trained the model for 1 epoch with a learning rate 5e-5 and batch size 2184 using AdamW optimizer.
|
22 |
+
|
23 |
+
### Preprocessing
|
24 |
+
|
25 |
+
The user handles that occur in the beginning of the tweet are removed. The rest is replaced by "@anonymized_account". Links are replaced with a special @URL token.
|
26 |
+
|
27 |
+
## Tokenizer
|
28 |
+
|
29 |
+
We use HerBERT tokenizer with special tokens added as above (@anonymized_account, @URL). Maximum sequence length is set to 128.
|
30 |
+
|
31 |
+
## License
|
32 |
+
|
33 |
+
CC BY 4.0
|
34 |
+
|
35 |
+
## KLEJ Benchmark results
|
36 |
+
|
37 |
+
We fine-tuned TrelBERT to [KLEJ benchmark](klejbenchmark.com) tasks and achieved the following results:
|
38 |
+
|
39 |
+
|model name | score |
|
40 |
+
|--|--|
|
41 |
+
|NKJP-NER|94.4|
|
42 |
+
|CDSC-E|93.9|
|
43 |
+
|CDSC-R|93.6|
|
44 |
+
|CBD|71.5|
|
45 |
+
|PolEmo2.0-IN|89.3|
|
46 |
+
|PolEmo2.0-OUT|78.1|
|
47 |
+
|DYK|67.4|
|
48 |
+
|PSC|95.7|
|
49 |
+
|AR|86.1|
|
50 |
+
|
51 |
+
For fine-tuning to KLEJ tasks we used [Polish RoBERTa](https://github.com/sdadas/polish-roberta) scripts, which we modified to use `transformers` library.
|
52 |
+
|
53 |
+
## Authors
|
54 |
+
|
55 |
+
Jakub Bartczuk, Krzysztof Dziedzic, Piotr Falkiewicz, Alicja Kotyla, Wojciech Szmyd, Michał Zobniów, Artur Zygadło
|
56 |
+
|
57 |
+
For more information, reach out to us via e-mail: artur.zygadlo@deepsense.ai
|