misotrnka commited on
Commit
9976077
1 Parent(s): 0557b0a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +122 -122
README.md CHANGED
@@ -1,123 +1,123 @@
1
- ---
2
- language: sk
3
- tags:
4
- - SlovakBERT
5
- license: mit
6
- datasets:
7
- - wikipedia
8
- - opensubtitles
9
- - oscar
10
- - gerulatawebcrawl
11
- - gerulatamonitoring
12
- - blbec.online
13
- ---
14
-
15
- # SlovakBERT (base-sized model)
16
- SlovakBERT pretrained model on Slovak language using a masked language modeling (MLM) objective. This model is case-sensitive: it makes a difference between slovensko and Slovensko.
17
-
18
- ## Intended uses & limitations
19
- You can use the raw model for masked language modeling, but it's mostly intended to be fine-tuned on a downstream task.
20
- **IMPORTANT**: The model was not trained on the “ and ” (direct quote) character -> so before tokenizing the text, it is advised to replace all “ and ” (direct quote marks) with a single "(double quote marks).
21
-
22
- ### How to use
23
- You can use this model directly with a pipeline for masked language modeling:
24
-
25
- ```python
26
- from transformers import pipeline
27
- unmasker = pipeline('fill-mask', model='gerulata/slovakbert')
28
- unmasker("Deti sa <mask> na ihrisku.")
29
-
30
- [{'sequence': 'Deti sa hrali na ihrisku.',
31
- 'score': 0.6355380415916443,
32
- 'token': 5949,
33
- 'token_str': ' hrali'},
34
- {'sequence': 'Deti sa hrajú na ihrisku.',
35
- 'score': 0.14731724560260773,
36
- 'token': 9081,
37
- 'token_str': ' hrajú'},
38
- {'sequence': 'Deti sa zahrali na ihrisku.',
39
- 'score': 0.05016357824206352,
40
- 'token': 32553,
41
- 'token_str': ' zahrali'},
42
- {'sequence': 'Deti sa stretli na ihrisku.',
43
- 'score': 0.041727423667907715,
44
- 'token': 5964,
45
- 'token_str': ' stretli'},
46
- {'sequence': 'Deti sa učia na ihrisku.',
47
- 'score': 0.01886524073779583,
48
- 'token': 18099,
49
- 'token_str': ' učia'}]
50
- ```
51
-
52
- Here is how to use this model to get the features of a given text in PyTorch:
53
- ```python
54
- from transformers import RobertaTokenizer, RobertaModel
55
- tokenizer = RobertaTokenizer.from_pretrained('gerulata/slovakbert')
56
- model = RobertaModel.from_pretrained('gerulata/slovakbert')
57
- text = "Text ktorý sa má embedovať."
58
- encoded_input = tokenizer(text, return_tensors='pt')
59
- output = model(**encoded_input)
60
- ```
61
- and in TensorFlow:
62
- ```python
63
- from transformers import RobertaTokenizer, TFRobertaModel
64
- tokenizer = RobertaTokenizer.from_pretrained('gerulata/slovakbert')
65
- model = TFRobertaModel.from_pretrained('gerulata/slovakbert')
66
- text = "Text ktorý sa má embedovať."
67
- encoded_input = tokenizer(text, return_tensors='tf')
68
- output = model(encoded_input)
69
- ```
70
- Or extract information from the model like this:
71
- ```python
72
- from transformers import pipeline
73
- unmasker = pipeline('fill-mask', model='gerulata/slovakbert')
74
- unmasker("Slovenské národne povstanie sa uskutočnilo v roku <mask>.")
75
-
76
- [{'sequence': 'Slovenske narodne povstanie sa uskutočnilo v roku 1944.',
77
- 'score': 0.7383289933204651,
78
- 'token': 16621,
79
- 'token_str': ' 1944'},...]
80
- ```
81
-
82
- # Training data
83
- The SlovakBERT model was pretrained on these datasets:
84
-
85
- - Wikipedia (326MB of text),
86
- - OpenSubtitles (415MB of text),
87
- - Oscar (4.6GB of text),
88
- - Gerulata WebCrawl (12.7GB of text) ,
89
- - Gerulata Monitoring (214 MB of text),
90
- - blbec.online (4.5GB of text)
91
-
92
- The text was then processed with the following steps:
93
- - URL and email addresses were replaced with special tokens ("url", "email").
94
- - Elongated interpunction was reduced (e.g. -- to -).
95
- - Markdown syntax was deleted.
96
- - All text content in braces f.g was eliminated to reduce the amount of markup and programming language text.
97
-
98
- We segmented the resulting corpus into sentences and removed duplicates to get 181.6M unique sentences. In total, the final corpus has 19.35GB of text.
99
-
100
- # Pretraining
101
- The model was trained in **fairseq** on 4 x Nvidia A100 GPUs for 300K steps with a batch size of 512 and a sequence length of 512. The optimizer used is Adam with a learning rate of 5e-4, \\(\beta_{1} = 0.9\\), \\(\beta_{2} = 0.98\\) and \\(\epsilon = 1e-6\\), a weight decay of 0.01, dropout rate 0.1, learning rate warmup for 10k steps and linear decay of the learning rate after. We used 16-bit float precision.
102
-
103
- ## About us
104
- <a href="https://www.gerulata.com/">
105
- <img width="300px" src="https://www.gerulata.com/images/gerulata-logo-blue.png">
106
- </a>
107
-
108
- Gerulata uses near real-time monitoring, advanced analytics and machine learning to help create a safer, more productive and enjoyable online environment for everyone.
109
-
110
- ### BibTeX entry and citation info
111
- If you find our resource or paper is useful, please consider including the following citation in your paper.
112
- - https://arxiv.org/abs/2109.15254
113
-
114
- ```
115
- @misc{pikuliak2021slovakbert,
116
- title={SlovakBERT: Slovak Masked Language Model},
117
- author={Matúš Pikuliak and Štefan Grivalský and Martin Konôpka and Miroslav Blšták and Martin Tamajka and Viktor Bachratý and Marián Šimko and Pavol Balážik and Michal Trnka and Filip Uhlárik},
118
- year={2021},
119
- eprint={2109.15254},
120
- archivePrefix={arXiv},
121
- primaryClass={cs.CL}
122
- }
123
  ```
 
1
+ ---
2
+ language: sk
3
+ tags:
4
+ - SlovakBERT
5
+ license: mit
6
+ datasets:
7
+ - wikipedia
8
+ - opensubtitles
9
+ - oscar
10
+ - gerulatawebcrawl
11
+ - gerulatamonitoring
12
+ - blbec.online
13
+ ---
14
+
15
+ # SlovakBERT (base-sized model)
16
+ SlovakBERT pretrained model on Slovak language using a masked language modeling (MLM) objective. This model is case-sensitive: it makes a difference between slovensko and Slovensko.
17
+
18
+ ## Intended uses & limitations
19
+ You can use the raw model for masked language modeling, but it's mostly intended to be fine-tuned on a downstream task.
20
+ **IMPORTANT**: The model was not trained on the “ and ” (direct quote) character -> so before tokenizing the text, it is advised to replace all “ and ” (direct quote marks) with a single "(double quote marks).
21
+
22
+ ### How to use
23
+ You can use this model directly with a pipeline for masked language modeling:
24
+
25
+ ```python
26
+ from transformers import pipeline
27
+ unmasker = pipeline('fill-mask', model='gerulata/slovakbert')
28
+ unmasker("Deti sa <mask> na ihrisku.")
29
+
30
+ [{'sequence': 'Deti sa hrali na ihrisku.',
31
+ 'score': 0.6355380415916443,
32
+ 'token': 5949,
33
+ 'token_str': ' hrali'},
34
+ {'sequence': 'Deti sa hrajú na ihrisku.',
35
+ 'score': 0.14731724560260773,
36
+ 'token': 9081,
37
+ 'token_str': ' hrajú'},
38
+ {'sequence': 'Deti sa zahrali na ihrisku.',
39
+ 'score': 0.05016357824206352,
40
+ 'token': 32553,
41
+ 'token_str': ' zahrali'},
42
+ {'sequence': 'Deti sa stretli na ihrisku.',
43
+ 'score': 0.041727423667907715,
44
+ 'token': 5964,
45
+ 'token_str': ' stretli'},
46
+ {'sequence': 'Deti sa učia na ihrisku.',
47
+ 'score': 0.01886524073779583,
48
+ 'token': 18099,
49
+ 'token_str': ' učia'}]
50
+ ```
51
+
52
+ Here is how to use this model to get the features of a given text in PyTorch:
53
+ ```python
54
+ from transformers import RobertaTokenizer, RobertaModel
55
+ tokenizer = RobertaTokenizer.from_pretrained('gerulata/slovakbert')
56
+ model = RobertaModel.from_pretrained('gerulata/slovakbert')
57
+ text = "Text ktorý sa má embedovať."
58
+ encoded_input = tokenizer(text, return_tensors='pt')
59
+ output = model(**encoded_input)
60
+ ```
61
+ and in TensorFlow:
62
+ ```python
63
+ from transformers import RobertaTokenizer, TFRobertaModel
64
+ tokenizer = RobertaTokenizer.from_pretrained('gerulata/slovakbert')
65
+ model = TFRobertaModel.from_pretrained('gerulata/slovakbert')
66
+ text = "Text ktorý sa má embedovať."
67
+ encoded_input = tokenizer(text, return_tensors='tf')
68
+ output = model(encoded_input)
69
+ ```
70
+ Or extract information from the model like this:
71
+ ```python
72
+ from transformers import pipeline
73
+ unmasker = pipeline('fill-mask', model='gerulata/slovakbert')
74
+ unmasker("Slovenské národne povstanie sa uskutočnilo v roku <mask>.")
75
+
76
+ [{'sequence': 'Slovenske narodne povstanie sa uskutočnilo v roku 1944.',
77
+ 'score': 0.7383289933204651,
78
+ 'token': 16621,
79
+ 'token_str': ' 1944'},...]
80
+ ```
81
+
82
+ # Training data
83
+ The SlovakBERT model was pretrained on these datasets:
84
+
85
+ - Wikipedia (326MB of text),
86
+ - OpenSubtitles (415MB of text),
87
+ - Oscar (4.6GB of text),
88
+ - Gerulata WebCrawl (12.7GB of text) ,
89
+ - Gerulata Monitoring (214 MB of text),
90
+ - blbec.online (4.5GB of text)
91
+
92
+ The text was then processed with the following steps:
93
+ - URL and email addresses were replaced with special tokens ("url", "email").
94
+ - Elongated interpunction was reduced (e.g. -- to -).
95
+ - Markdown syntax was deleted.
96
+ - All text content in braces f.g was eliminated to reduce the amount of markup and programming language text.
97
+
98
+ We segmented the resulting corpus into sentences and removed duplicates to get 181.6M unique sentences. In total, the final corpus has 19.35GB of text.
99
+
100
+ # Pretraining
101
+ The model was trained in **fairseq** on 4 x Nvidia A100 GPUs for 300K steps with a batch size of 512 and a sequence length of 512. The optimizer used is Adam with a learning rate of 5e-4, \\(\beta_{1} = 0.9\\), \\(\beta_{2} = 0.98\\) and \\(\epsilon = 1e-6\\), a weight decay of 0.01, dropout rate 0.1, learning rate warmup for 10k steps and linear decay of the learning rate after. We used 16-bit float precision.
102
+
103
+ ## About us
104
+ <a href="https://www.gerulata.com/">
105
+ <img width="300px" src="https://www.gerulata.com/assets/images/Logo_Footer.svg">
106
+ </a>
107
+
108
+ Gerulata uses near real-time monitoring, advanced analytics and machine learning to help create a safer, more productive and enjoyable online environment for everyone.
109
+
110
+ ### BibTeX entry and citation info
111
+ If you find our resource or paper is useful, please consider including the following citation in your paper.
112
+ - https://arxiv.org/abs/2109.15254
113
+
114
+ ```
115
+ @misc{pikuliak2021slovakbert,
116
+ title={SlovakBERT: Slovak Masked Language Model},
117
+ author={Matúš Pikuliak and Štefan Grivalský and Martin Konôpka and Miroslav Blšták and Martin Tamajka and Viktor Bachratý and Marián Šimko and Pavol Balážik and Michal Trnka and Filip Uhlárik},
118
+ year={2021},
119
+ eprint={2109.15254},
120
+ archivePrefix={arXiv},
121
+ primaryClass={cs.CL}
122
+ }
123
  ```