dsai-artur-zygadlo commited on
Commit
1e22557
1 Parent(s): a8f17fa

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -6
README.md CHANGED
@@ -12,8 +12,7 @@ widget:
12
 
13
  # TrelBERT
14
 
15
- TrelBERT is a BERT-based Language Model trained on data from Polish Twitter.
16
- It uses Masked Language Modeling objective. It is based on [HerBERT](https://aclanthology.org/2021.bsnlp-1.1) model and therefore released under the same license - CC BY 4.0.
17
 
18
  ## Training
19
 
@@ -23,7 +22,7 @@ The data we used for MLM fine-tuning was approximately 45 million Polish tweets.
23
 
24
  ### Preprocessing
25
 
26
- For each Tweet, the user handles that occur in the beginning of the text were removed, as they are not part of the message content but only represent who the user is replying to. The remaining user handles were replaced by "@anonymized_account". Links are replaced with a special @URL token.
27
 
28
  ## Tokenizer
29
 
@@ -48,11 +47,11 @@ We fine-tuned TrelBERT to [KLEJ benchmark](klejbenchmark.com) tasks and achieved
48
  |DYK|67.4|
49
  |PSC|95.7|
50
  |AR|86.1|
51
- |__avg__|__86.1__|
52
 
53
- For fine-tuning to KLEJ tasks we used [Polish RoBERTa](https://github.com/sdadas/polish-roberta) scripts, which we modified to use `transformers` library. In the CBD task, we set the maximum sequence length to 128 and implemented the same preprocessing procedure as in the MLM phase.
54
 
55
- The model achieved 1st place in cyberbullying detection (CBD) task in the [KLEJ leaderboard](https://klejbenchmark.com/leaderboard). Overall, it reached 7th place, just below HerBERT model.
56
 
57
  ## Authors
58
 
 
12
 
13
  # TrelBERT
14
 
15
+ TrelBERT is a BERT-based Language Model trained on data from Polish Twitter using Masked Language Modeling objective. It is based on [HerBERT](https://aclanthology.org/2021.bsnlp-1.1) model and therefore released under the same license - CC BY 4.0.
 
16
 
17
  ## Training
18
 
 
22
 
23
  ### Preprocessing
24
 
25
+ For each Tweet, the user handles that occur in the beginning of the text were removed, as they are not part of the message content but only represent who the user is replying to. The remaining user handles were replaced by "@anonymized_account". Links were replaced with a special @URL token.
26
 
27
  ## Tokenizer
28
 
 
47
  |DYK|67.4|
48
  |PSC|95.7|
49
  |AR|86.1|
50
+ |__average__|__86.1__|
51
 
52
+ For fine-tuning to KLEJ tasks we used [Polish RoBERTa](https://github.com/sdadas/polish-roberta) scripts, which we modified to use `transformers` library. For the CBD task, we set the maximum sequence length to 128 and implemented the same preprocessing procedure as in the MLM phase.
53
 
54
+ Our model achieved 1st place in cyberbullying detection (CBD) task in the [KLEJ leaderboard](https://klejbenchmark.com/leaderboard). Overall, it reached 7th place, just below HerBERT model.
55
 
56
  ## Authors
57