dsai-artur-zygadlo commited on
Commit
6894a2f
1 Parent(s): cbbdf46

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +13 -10
README.md CHANGED
@@ -2,31 +2,31 @@
2
  license: cc-by-4.0
3
  pipeline_tag: fill-mask
4
  widget:
 
5
  - text: "Robert Kubica jest najlepszym <mask>."
6
  - text: "<mask> jest największym zdrajcą <mask>."
7
- - text: "Sztuczna inteligencja to <mask>."
8
- - text: "Twoja <mask>."
9
  - text: "<mask> to najlepszy polski klub."
 
10
  ---
11
 
12
  # TrelBERT
13
 
14
- TrelBERT is a BERT-based Language Model trained on Polish Twitter.
15
- It uses Masked Language Model objective. It is based on [HerBERT](https://aclanthology.org/2021.bsnlp-1.1) model and therefore released under the same license - CC BY 4.0.
16
 
17
  ## Training
18
 
19
- We trained our model starting from [`herbert-base-cased`](https://huggingface.co/allegro/herbert-base-cased) checkpoint and continued Masked Language Modeling training using data collected from Twitter.
20
 
21
- The data we used for MLM fine-tuning was approximately 45 million Polish tweets. We trained the model for 1 epoch with a learning rate 5e-5 and batch size 2184 using AdamW optimizer.
22
 
23
  ### Preprocessing
24
 
25
- The user handles that occur in the beginning of the tweet are removed. The rest is replaced by "@anonymized_account". Links are replaced with a special @URL token.
26
 
27
  ## Tokenizer
28
 
29
- We use HerBERT tokenizer with special tokens added as above (@anonymized_account, @URL). Maximum sequence length is set to 128.
30
 
31
  ## License
32
 
@@ -41,14 +41,17 @@ We fine-tuned TrelBERT to [KLEJ benchmark](klejbenchmark.com) tasks and achieved
41
  |NKJP-NER|94.4|
42
  |CDSC-E|93.9|
43
  |CDSC-R|93.6|
44
- |CBD|71.5|
45
  |PolEmo2.0-IN|89.3|
46
  |PolEmo2.0-OUT|78.1|
47
  |DYK|67.4|
48
  |PSC|95.7|
49
  |AR|86.1|
 
 
 
50
 
51
- For fine-tuning to KLEJ tasks we used [Polish RoBERTa](https://github.com/sdadas/polish-roberta) scripts, which we modified to use `transformers` library.
52
 
53
  ## Authors
54
 
 
2
  license: cc-by-4.0
3
  pipeline_tag: fill-mask
4
  widget:
5
+ - text: "Sztuczna inteligencja to <mask>."
6
  - text: "Robert Kubica jest najlepszym <mask>."
7
  - text: "<mask> jest największym zdrajcą <mask>."
 
 
8
  - text: "<mask> to najlepszy polski klub."
9
+ - text: "Twoja <mask>."
10
  ---
11
 
12
  # TrelBERT
13
 
14
+ TrelBERT is a BERT-based Language Model trained on data from Polish Twitter.
15
+ It uses Masked Language Modeling objective. It is based on [HerBERT](https://aclanthology.org/2021.bsnlp-1.1) model and therefore released under the same license - CC BY 4.0.
16
 
17
  ## Training
18
 
19
+ We trained our model starting from [`herbert-base-cased`](https://huggingface.co/allegro/herbert-base-cased) checkpoint and continued MLM training using data collected from Twitter.
20
 
21
+ The data we used for MLM fine-tuning was approximately 45 million Polish tweets. We trained the model for 1 epoch with a learning rate `5e-5` and batch size `2184` using AdamW optimizer.
22
 
23
  ### Preprocessing
24
 
25
+ For each Tweet, the user handles that occur in the beginning of the text were removed, as they are not part of the message content but only represent who the user is replying to. The remaining user handles were replaced by "@anonymized_account". Links are replaced with a special @URL token.
26
 
27
  ## Tokenizer
28
 
29
+ We use HerBERT tokenizer with two special tokens added for preprocessing purposes as described above (@anonymized_account, @URL). Maximum sequence length is set to 128, based on the analysis of Twitter data distribution.
30
 
31
  ## License
32
 
 
41
  |NKJP-NER|94.4|
42
  |CDSC-E|93.9|
43
  |CDSC-R|93.6|
44
+ |CBD|75.2|
45
  |PolEmo2.0-IN|89.3|
46
  |PolEmo2.0-OUT|78.1|
47
  |DYK|67.4|
48
  |PSC|95.7|
49
  |AR|86.1|
50
+ |avg|86.0|
51
+
52
+ For fine-tuning to KLEJ tasks we used [Polish RoBERTa](https://github.com/sdadas/polish-roberta) scripts, which we modified to use `transformers` library. In the CBD task, we set the maximum sequence length to 128 and implemented the same preprocessing procedure as in the MLM phase.
53
 
54
+ The model achieved 1st place in cyberbullying detection (CBD) task in the [KLEJ leaderboard](https://klejbenchmark.com/leaderboard). Overall, it reached 7th place, just below HerBERT model.
55
 
56
  ## Authors
57