Deseret 8k BPE tokenizer

A byte-level BPE tokenizer with 8,192 vocabulary trained on the FineWeb-Edu translated to Deseret corpus.

Special tokens

Token ID Purpose
<|pad|> 0 Padding
<|bos|> 1 Beginning of sequence
<|eos|> 2 End of sequence
<|user|> 3 User turn marker (SFT)
<|assistant|> 4 Assistant turn marker (SFT)
<|system|> 5 Optional system turn marker

Usage

from tokenizers import Tokenizer
from huggingface_hub import hf_hub_download

path = hf_hub_download(repo_id="chrisjpatty/deseret-8k-bpe", filename="deseret_8k.json")
tok = Tokenizer.from_file(path)

text = "๐ข๐ฒ๐‘‰๐‘Œ๐ฎ๐‘ ๐‘„๐ฒ ๐”๐ฏ๐‘…๐ฒ๐‘‰๐ฏ๐ป ๐ˆ๐‘Š๐‘๐ฒ๐บ๐ฏ๐ป ๐ฎ๐‘† ๐‘๐ฒ๐‘Œ."
encoded = tok.encode(text)
print(encoded.ids)        # token ids
print(encoded.tokens)     # token strings (byte-level encoded)
print(tok.decode(encoded.ids))

Compression ratio

Roughly ~2.5 Deseret characters per token on prose, with common phoneme sequences like ๐‘Œ๐ฎ๐‘ (-ing), ๐‘‡๐ฒ๐‘Œ (-tion), ๐‘„๐ฒ (the) collapsed into single tokens.

Training details

  • Algorithm: byte-level BPE (tokenizers.models.BPE)
  • Pre-tokenizer: ByteLevel(add_prefix_space=False)
  • Normalizer: NFC
  • Trained on: full 125 GB Deseret-translated FineWeb-Edu corpus
  • Min frequency: 2
  • Trained with HuggingFace tokenizers library
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support