mjbommar commited on
Commit
f0c62e0
1 Parent(s): 54dcba7

switched to BPE with postprocessor due to RobertaTokenizer bug

Browse files
Files changed (4) hide show
  1. README.md +7 -7
  2. added_tokens.json +0 -0
  3. merges.txt +0 -0
  4. vocab.json +0 -0
README.md CHANGED
@@ -12,18 +12,17 @@ tags:
12
  - alea
13
  - legal
14
  - financial
15
- - mlm
16
  date: '2024-11-07T00:00:00.000Z'
17
  ---
18
 
19
- # kl3m-004-128k-uncased-mlm tokenizer
20
 
21
- **NOTE:** This is the same vocabulary as `kl3m-004-128k-uncased`, but packaged within a RobertaTokenizer class
22
  to provide special token handling without loading a custom tokenizer class.
23
 
24
- The `kl3m-004-128k-uncased-mlm` **case-insensitive** tokenizer is a domain-specific tokenizer trained on a stratified sample of nearly 4M
25
  documents across general, legal, and financial domains from the `kl3m-data` project, including American English,
26
- British English, Spanish, German, French, Italian, and other common EU languages.
27
 
28
  This tokenizer is being used for the next generation of KL3M embedding and generative models.
29
 
@@ -157,8 +156,9 @@ IDs: [536, 16356, 292, 281, 4272, 460, 628, 281, 1552, 1545, 397, 882, 309, 4378
157
  Use the code below to get started with the model.
158
 
159
  ```
160
- from transformers import AutoTokenizer
161
- tokenizer = AutoTokenizer.from_pretrained('alea-institute/kl3m-004-128k-uncased-mlm')
 
162
  ```
163
 
164
  ## Citation
 
12
  - alea
13
  - legal
14
  - financial
 
15
  date: '2024-11-07T00:00:00.000Z'
16
  ---
17
 
18
+ # kl3m-004-128k-uncased tokenizer
19
 
20
+ **NOTE**: This is the same vocabulary as kl3m-004-128k-uncased, but packaged within a `RobertaProcessing` `post_processor` class
21
  to provide special token handling without loading a custom tokenizer class.
22
 
23
+ The `kl3m-004-128k-uncased` **case-insensitive** tokenizer is a domain-specific tokenizer trained on a stratified sample of nearly 4M
24
  documents across general, legal, and financial domains from the `kl3m-data` project, including American English,
25
+ British English, Spanish, German, French, Italian, and other common EU languages.
26
 
27
  This tokenizer is being used for the next generation of KL3M embedding and generative models.
28
 
 
156
  Use the code below to get started with the model.
157
 
158
  ```
159
+ from tokenizers import Tokenizer
160
+
161
+ tokenizer = Tokenizer.from_pretrained('alea-institute/kl3m-004-128k-uncased')
162
  ```
163
 
164
  ## Citation
added_tokens.json DELETED
The diff for this file is too large to render. See raw diff
 
merges.txt DELETED
The diff for this file is too large to render. See raw diff
 
vocab.json DELETED
The diff for this file is too large to render. See raw diff