matejulcar commited on
Commit
e3529a6
1 Parent(s): 0d43db2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +9 -9
README.md CHANGED
@@ -4,6 +4,15 @@ language:
4
 
5
  license: cc-by-sa-4.0
6
  ---
 
 
 
 
 
 
 
 
 
7
 
8
  # SloBERTa
9
  SloBERTa model is a monolingual Slovene BERT-like model. It is closely related to French Camembert model https://camembert-model.fr/. The corpora used for training the model have 3.47 billion tokens in total. The subword vocabulary contains 32,000 tokens. The scripts and programs used for data preparation and training the model are available on https://github.com/clarinsi/Slovene-BERT-Tool
@@ -18,12 +27,3 @@ The following corpora were used for training the model:
18
  * Slovenian parliamentary corpus siParl 2.0
19
  * slWaC
20
 
21
- # Usage
22
- Load in transformers library with:
23
- ```
24
- from transformers import AutoTokenizer, AutoModelForMaskedLM
25
-
26
- tokenizer = AutoTokenizer.from_pretrained("EMBEDDIA/sloberta", use_fast=False)
27
- model = AutoModelForMaskedLM.from_pretrained("EMBEDDIA/sloberta")
28
- ```
29
- **Note**: it is currently critically important to add `use_fast=False` parameter to tokenizer. By default it attempts to load a fast tokenizer, which will work (ie. not result in an error), but it will not correctly map tokens to its IDs and the performance on any task will be extremely bad.
 
4
 
5
  license: cc-by-sa-4.0
6
  ---
7
+ # Usage
8
+ Load in transformers library with:
9
+ ```
10
+ from transformers import AutoTokenizer, AutoModelForMaskedLM
11
+
12
+ tokenizer = AutoTokenizer.from_pretrained("EMBEDDIA/sloberta", use_fast=False)
13
+ model = AutoModelForMaskedLM.from_pretrained("EMBEDDIA/sloberta")
14
+ ```
15
+ **NOTE**: it is currently *critically important* to add `use_fast=False` parameter to tokenizer. By default it attempts to load a fast tokenizer, which will work (ie. not result in an error), but it will not correctly map tokens to its IDs and the performance on any task will be extremely bad.
16
 
17
  # SloBERTa
18
  SloBERTa model is a monolingual Slovene BERT-like model. It is closely related to French Camembert model https://camembert-model.fr/. The corpora used for training the model have 3.47 billion tokens in total. The subword vocabulary contains 32,000 tokens. The scripts and programs used for data preparation and training the model are available on https://github.com/clarinsi/Slovene-BERT-Tool
 
27
  * Slovenian parliamentary corpus siParl 2.0
28
  * slWaC
29