Deseret 8k BPE tokenizer

A byte-level BPE tokenizer with 8,192 vocabulary trained on the FineWeb-Edu translated to Deseret corpus.

Special tokens

Token	ID	Purpose
`<\|pad\|>`	0	Padding
`<\|bos\|>`	1	Beginning of sequence
`<\|eos\|>`	2	End of sequence
`<\|user\|>`	3	User turn marker (SFT)
`<\|assistant\|>`	4	Assistant turn marker (SFT)
`<\|system\|>`	5	Optional system turn marker

Usage

from tokenizers import Tokenizer
from huggingface_hub import hf_hub_download

path = hf_hub_download(repo_id="chrisjpatty/deseret-8k-bpe", filename="deseret_8k.json")
tok = Tokenizer.from_file(path)

text = "𐐢𐐲𐑉𐑌𐐮𐑍 𐑄𐐲 𐐔𐐯𐑅𐐲𐑉𐐯𐐻 𐐈𐑊𐑁𐐲𐐺𐐯𐐻 𐐮𐑆 𐑁𐐲𐑌."
encoded = tok.encode(text)
print(encoded.ids)        # token ids
print(encoded.tokens)     # token strings (byte-level encoded)
print(tok.decode(encoded.ids))

Compression ratio

Roughly ~2.5 Deseret characters per token on prose, with common phoneme sequences like 𐑌𐐮𐑍 (-ing), 𐑇𐐲𐑌 (-tion), 𐑄𐐲 (the) collapsed into single tokens.

Training details

Algorithm: byte-level BPE (tokenizers.models.BPE)
Pre-tokenizer: ByteLevel(add_prefix_space=False)
Normalizer: NFC
Trained on: full 125 GB Deseret-translated FineWeb-Edu corpus
Min frequency: 2
Trained with HuggingFace tokenizers library

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support