switched to BPE with postprocessor due to RobertaTokenizer bug

Files changed (4) hide show

README.md CHANGED Viewed

@@ -12,18 +12,17 @@ tags:
 - alea
 - legal
 - financial
-- mlm
 date: '2024-11-07T00:00:00.000Z'
 ---
-# kl3m-004-128k-uncased-mlm tokenizer
-**NOTE:** This is the same vocabulary as `kl3m-004-128k-uncased`, but packaged within a RobertaTokenizer class
 to provide special token handling without loading a custom tokenizer class.
-The `kl3m-004-128k-uncased-mlm` **case-insensitive** tokenizer is a domain-specific tokenizer trained on a stratified sample of nearly 4M
 documents across general, legal, and financial domains from the `kl3m-data` project, including American English,
-British English, Spanish, German, French, Italian, and other common EU languages.
 This tokenizer is being used for the next generation of KL3M embedding and generative models.
@@ -157,8 +156,9 @@ IDs: [536, 16356, 292, 281, 4272, 460, 628, 281, 1552, 1545, 397, 882, 309, 4378
 Use the code below to get started with the model.
 ```
-from transformers import AutoTokenizer
-tokenizer = AutoTokenizer.from_pretrained('alea-institute/kl3m-004-128k-uncased-mlm')
 ```
 ## Citation

 - alea
 - legal
 - financial
 date: '2024-11-07T00:00:00.000Z'
 ---
+# kl3m-004-128k-uncased tokenizer
+**NOTE**: This is the same vocabulary as kl3m-004-128k-uncased, but packaged within a `RobertaProcessing` `post_processor` class
 to provide special token handling without loading a custom tokenizer class.
+The `kl3m-004-128k-uncased` **case-insensitive** tokenizer is a domain-specific tokenizer trained on a stratified sample of nearly 4M
 documents across general, legal, and financial domains from the `kl3m-data` project, including American English,
+British English, Spanish, German, French, Italian, and other common EU languages.
 This tokenizer is being used for the next generation of KL3M embedding and generative models.
 Use the code below to get started with the model.
 ```
+from tokenizers import Tokenizer
+tokenizer = Tokenizer.from_pretrained('alea-institute/kl3m-004-128k-uncased')
 ```
 ## Citation

added_tokens.json DELETED Viewed

The diff for this file is too large to render. See raw diff

merges.txt DELETED Viewed

The diff for this file is too large to render. See raw diff

vocab.json DELETED Viewed

The diff for this file is too large to render. See raw diff