alea-institute
commited on
Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
@@ -1,17 +1,12 @@
|
|
1 |
---
|
2 |
library_name: tokenizers
|
3 |
-
tags:
|
4 |
-
- kl3m
|
5 |
-
- kl3m-001
|
6 |
-
- alea
|
7 |
-
- legal
|
8 |
-
- financial
|
9 |
date: 2023-12-28
|
10 |
---
|
11 |
|
12 |
# kl3m-001-32k tokenizer
|
13 |
|
14 |
-
The `kl3m-001-32k` tokenizer is a domain-specific tokenizer trained on ~500B of financial and legal text from primarily-English sources.
|
15 |
|
16 |
This tokenizer was used for the first generation of KL3M embedding and generative models, including
|
17 |
`kl3m-170M`, `kl3m-1.7B`, `kl3m-embedding-001`, and `kl3m-embedding-002`.
|
@@ -33,7 +28,7 @@ Please see `kl3m-003-64k` for the next iteration of our research on domain-speci
|
|
33 |
|
34 |
### Model Description
|
35 |
|
36 |
-
The `kl3m-001-32k` tokenizer is a domain-specific tokenizer trained on ~500B of financial and legal text from primarily-English sources.
|
37 |
|
38 |
This tokenizer is notable for a number of reasons:
|
39 |
|
|
|
1 |
---
|
2 |
library_name: tokenizers
|
3 |
+
tags: ['kl3m', 'kl3m-001', 'alea', 'legal', 'financial']
|
|
|
|
|
|
|
|
|
|
|
4 |
date: 2023-12-28
|
5 |
---
|
6 |
|
7 |
# kl3m-001-32k tokenizer
|
8 |
|
9 |
+
The `kl3m-001-32k` tokenizer is a domain-specific tokenizer trained on ~500B tokens of financial and legal text from primarily-English sources.
|
10 |
|
11 |
This tokenizer was used for the first generation of KL3M embedding and generative models, including
|
12 |
`kl3m-170M`, `kl3m-1.7B`, `kl3m-embedding-001`, and `kl3m-embedding-002`.
|
|
|
28 |
|
29 |
### Model Description
|
30 |
|
31 |
+
The `kl3m-001-32k` tokenizer is a domain-specific tokenizer trained on ~500B tokens of financial and legal text from primarily-English sources.
|
32 |
|
33 |
This tokenizer is notable for a number of reasons:
|
34 |
|