EMBEDDIA
/

sloberta

matejulcar commited on Nov 24, 2021

Commit

49db152

1 Parent(s): 1d00622

added fast tokenizer

Files changed (1) hide show

README.md CHANGED Viewed

@@ -9,10 +9,10 @@ Load in transformers library with:
 ```
 from transformers import AutoTokenizer, AutoModelForMaskedLM
-  tokenizer = AutoTokenizer.from_pretrained("EMBEDDIA/sloberta", use_fast=False)
   model = AutoModelForMaskedLM.from_pretrained("EMBEDDIA/sloberta")
 ```
-**NOTE**: it is currently *critically important* to add `use_fast=False` parameter to tokenizer if using transformers version 4+ (prior versions have `use_fast=False` as default) By default it attempts to load a fast tokenizer, which will work (ie. not result in an error), but it will not correctly map tokens to its IDs and the performance on any task will be extremely bad.
 # SloBERTa
 SloBERTa model is a monolingual Slovene BERT-like model. It is closely related to French Camembert model https://camembert-model.fr/. The corpora used for training the model have 3.47 billion tokens in total. The subword vocabulary contains 32,000 tokens. The scripts and programs used for data preparation and training the model are available on https://github.com/clarinsi/Slovene-BERT-Tool

 ```
 from transformers import AutoTokenizer, AutoModelForMaskedLM
+  tokenizer = AutoTokenizer.from_pretrained("EMBEDDIA/sloberta")
   model = AutoModelForMaskedLM.from_pretrained("EMBEDDIA/sloberta")
 ```
 # SloBERTa
 SloBERTa model is a monolingual Slovene BERT-like model. It is closely related to French Camembert model https://camembert-model.fr/. The corpora used for training the model have 3.47 billion tokens in total. The subword vocabulary contains 32,000 tokens. The scripts and programs used for data preparation and training the model are available on https://github.com/clarinsi/Slovene-BERT-Tool