matejulcar
commited on
Commit
•
0d43db2
1
Parent(s):
2a4cff4
Update README.md
Browse files
README.md
CHANGED
@@ -16,4 +16,14 @@ The following corpora were used for training the model:
|
|
16 |
* Kas 1.0
|
17 |
* Janes 1.0 (only Janes-news, Janes-forum, Janes-blog, Janes-wiki subcorpora)
|
18 |
* Slovenian parliamentary corpus siParl 2.0
|
19 |
-
* slWaC
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
16 |
* Kas 1.0
|
17 |
* Janes 1.0 (only Janes-news, Janes-forum, Janes-blog, Janes-wiki subcorpora)
|
18 |
* Slovenian parliamentary corpus siParl 2.0
|
19 |
+
* slWaC
|
20 |
+
|
21 |
+
# Usage
|
22 |
+
Load in transformers library with:
|
23 |
+
```
|
24 |
+
from transformers import AutoTokenizer, AutoModelForMaskedLM
|
25 |
+
|
26 |
+
tokenizer = AutoTokenizer.from_pretrained("EMBEDDIA/sloberta", use_fast=False)
|
27 |
+
model = AutoModelForMaskedLM.from_pretrained("EMBEDDIA/sloberta")
|
28 |
+
```
|
29 |
+
**Note**: it is currently critically important to add `use_fast=False` parameter to tokenizer. By default it attempts to load a fast tokenizer, which will work (ie. not result in an error), but it will not correctly map tokens to its IDs and the performance on any task will be extremely bad.
|