switched to BPE with postprocessor due to RobertaTokenizer bug
Browse files- README.md +7 -7
- added_tokens.json +0 -0
- merges.txt +0 -0
- vocab.json +0 -0
README.md
CHANGED
@@ -12,18 +12,17 @@ tags:
|
|
12 |
- alea
|
13 |
- legal
|
14 |
- financial
|
15 |
-
- mlm
|
16 |
date: '2024-11-07T00:00:00.000Z'
|
17 |
---
|
18 |
|
19 |
-
# kl3m-004-128k-uncased
|
20 |
|
21 |
-
**NOTE
|
22 |
to provide special token handling without loading a custom tokenizer class.
|
23 |
|
24 |
-
The `kl3m-004-128k-uncased
|
25 |
documents across general, legal, and financial domains from the `kl3m-data` project, including American English,
|
26 |
-
British English, Spanish, German, French, Italian, and other common EU languages.
|
27 |
|
28 |
This tokenizer is being used for the next generation of KL3M embedding and generative models.
|
29 |
|
@@ -157,8 +156,9 @@ IDs: [536, 16356, 292, 281, 4272, 460, 628, 281, 1552, 1545, 397, 882, 309, 4378
|
|
157 |
Use the code below to get started with the model.
|
158 |
|
159 |
```
|
160 |
-
from
|
161 |
-
|
|
|
162 |
```
|
163 |
|
164 |
## Citation
|
|
|
12 |
- alea
|
13 |
- legal
|
14 |
- financial
|
|
|
15 |
date: '2024-11-07T00:00:00.000Z'
|
16 |
---
|
17 |
|
18 |
+
# kl3m-004-128k-uncased tokenizer
|
19 |
|
20 |
+
**NOTE**: This is the same vocabulary as kl3m-004-128k-uncased, but packaged within a `RobertaProcessing` `post_processor` class
|
21 |
to provide special token handling without loading a custom tokenizer class.
|
22 |
|
23 |
+
The `kl3m-004-128k-uncased` **case-insensitive** tokenizer is a domain-specific tokenizer trained on a stratified sample of nearly 4M
|
24 |
documents across general, legal, and financial domains from the `kl3m-data` project, including American English,
|
25 |
+
British English, Spanish, German, French, Italian, and other common EU languages.
|
26 |
|
27 |
This tokenizer is being used for the next generation of KL3M embedding and generative models.
|
28 |
|
|
|
156 |
Use the code below to get started with the model.
|
157 |
|
158 |
```
|
159 |
+
from tokenizers import Tokenizer
|
160 |
+
|
161 |
+
tokenizer = Tokenizer.from_pretrained('alea-institute/kl3m-004-128k-uncased')
|
162 |
```
|
163 |
|
164 |
## Citation
|
added_tokens.json
DELETED
The diff for this file is too large to render.
See raw diff
|
|
merges.txt
DELETED
The diff for this file is too large to render.
See raw diff
|
|
vocab.json
DELETED
The diff for this file is too large to render.
See raw diff
|
|