Deseret 8k BPE tokenizer
A byte-level BPE tokenizer with 8,192 vocabulary trained on the FineWeb-Edu translated to Deseret corpus.
Special tokens
| Token | ID | Purpose |
|---|---|---|
<|pad|> |
0 | Padding |
<|bos|> |
1 | Beginning of sequence |
<|eos|> |
2 | End of sequence |
<|user|> |
3 | User turn marker (SFT) |
<|assistant|> |
4 | Assistant turn marker (SFT) |
<|system|> |
5 | Optional system turn marker |
Usage
from tokenizers import Tokenizer
from huggingface_hub import hf_hub_download
path = hf_hub_download(repo_id="chrisjpatty/deseret-8k-bpe", filename="deseret_8k.json")
tok = Tokenizer.from_file(path)
text = "๐ข๐ฒ๐๐๐ฎ๐ ๐๐ฒ ๐๐ฏ๐
๐ฒ๐๐ฏ๐ป ๐๐๐๐ฒ๐บ๐ฏ๐ป ๐ฎ๐ ๐๐ฒ๐."
encoded = tok.encode(text)
print(encoded.ids) # token ids
print(encoded.tokens) # token strings (byte-level encoded)
print(tok.decode(encoded.ids))
Compression ratio
Roughly ~2.5 Deseret characters per token on prose, with common phoneme sequences like ๐๐ฎ๐ (-ing), ๐๐ฒ๐ (-tion), ๐๐ฒ (the) collapsed into single tokens.
Training details
- Algorithm: byte-level BPE (
tokenizers.models.BPE) - Pre-tokenizer:
ByteLevel(add_prefix_space=False) - Normalizer: NFC
- Trained on: full 125 GB Deseret-translated FineWeb-Edu corpus
- Min frequency: 2
- Trained with HuggingFace
tokenizerslibrary
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support