dsai-artur-zygadlo commited on
Commit
952eb04
1 Parent(s): 088df96

Update model card

Browse files
Files changed (1) hide show
  1. README.md +54 -0
README.md CHANGED
@@ -1,3 +1,57 @@
1
  ---
2
  license: cc-by-4.0
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: cc-by-4.0
3
+ pipeline_tag: fill-mask
4
+ widget:
5
+ - text: "Robert Kubica jest najlepszym <mask>."
6
+ - text: "<mask> jest największym zdrajcą <mask>."
7
+ - text: "Sztuczna inteligencja to <mask>."
8
+ - text: "Twoja <mask>."
9
+ - text: "<mask> to najlepszy polski klub."
10
  ---
11
+
12
+ # TrelBERT
13
+
14
+ TrelBERT is a BERT-based Language Model trained on Polish Twitter.
15
+ It uses Masked Language Model objective. It is based on [HerBERT](https://arxiv.org/abs/2105.01735) model and therefore released under the same license - CC BY 4.0.
16
+
17
+ ## Training
18
+
19
+ We trained our model starting from [`herbert-base-cased`](https://huggingface.co/allegro/herbert-base-cased) checkpoint and continued Masked Language Modeling training using data collected from Twitter.
20
+
21
+ The data we used for MLM fine-tuning was approximately 45 million Polish tweets. We trained the model for 1 epoch with a learning rate 5e-5 and batch size 2184 using AdamW optimizer.
22
+
23
+ ### Preprocessing
24
+
25
+ The user handles that occur in the beginning of the tweet are removed. The rest is replaced by "@anonymized_account". Links are replaced with a special @URL token.
26
+
27
+ ## Tokenizer
28
+
29
+ We use HerBERT tokenizer with special tokens added as above (@anonymized_account, @URL). Maximum sequence length is set to 128.
30
+
31
+ ## License
32
+
33
+ CC BY 4.0
34
+
35
+ ## KLEJ Benchmark results
36
+
37
+ We fine-tuned TrelBERT to [KLEJ benchmark](klejbenchmark.com) tasks and achieved the following results:
38
+
39
+ |model name | score |
40
+ |--|--|
41
+ |NKJP-NER|94.4|
42
+ |CDSC-E|93.9|
43
+ |CDSC-R|93.6|
44
+ |CBD|71.5|
45
+ |PolEmo2.0-IN|89.3|
46
+ |PolEmo2.0-OUT|78.1|
47
+ |DYK|67.4|
48
+ |PSC|95.7|
49
+ |AR|86.1|
50
+
51
+ For fine-tuning to KLEJ tasks we used [Polish RoBERTa](https://github.com/sdadas/polish-roberta) scripts, which we modified to use `transformers` library.
52
+
53
+ ## Authors
54
+
55
+ Jakub Bartczuk, Krzysztof Dziedzic, Piotr Falkiewicz, Alicja Kotyla, Wojciech Szmyd, Michał Zobniów, Artur Zygadło
56
+
57
+ For more information, reach out to us via e-mail: artur.zygadlo@deepsense.ai