Til-Tokenizer-128k
Morpheme-aware, multilingual, agentic byte-level BPE tokenizer for the Til Kazakh foundation-model program. 128 000 vocab. Kazakh-first, with Russian, English, code, math, and agent/tool-call structure.
Highlights
- Never emits
<unk>— byte-level: every byte is representable, so any unicode, emoji, CJK, or rare symbol round-trips.decode(encode(x)) == x(lossless) on every domain. - Single-digit numbers —
12345 → 1 2 3 4 5,3.14159 → 3 . 1 4 1 5 9. Gives the model clean place-value alignment for arithmetic. - Kazakh morpheme awareness — optional segmentation so no token crosses a morpheme boundary.
Boundary recall 0.896 / precision 0.85 / F1 0.87 vs gold (
kaz-morphology-sample). - Agentic / structural — JSON, tool-call, markdown, diff, and code are tokenized cheaply (learned from real agent data), plus dedicated chat / FIM / reasoning / tool-call control tokens.
- Code-friendly — indentation runs (4 sp / 8 sp / tab) and operators (
== != -> => ** :: && ||) are single tokens. - 384 reserved slots — add future special tokens without resizing the embedding.
Design / scheme
Base: byte-level BPE (HF tokenizers). The ByteLevel alphabet covers all 256 bytes, so the
tokenizer is lossless and can never UNK.
Pre-tokenizer pipeline (in order):
Split(\x1F)— morpheme-boundary marker (fires only when the input was morpheme-segmented; see below).Digits(individual_digits=True)— every digit becomes its own token.ByteLevel(add_prefix_space=False)— preserves whitespace / indentation (code-friendly).
Vocabulary (131 072):
- ~430 special / control tokens (kept whole):
- chat / turn:
<|im_start|> <|im_end|>+<|system|> <|user|> <|assistant|> <|tool|> <|tool_result|> - reasoning:
<thinking> </thinking> - tool-call control:
<|tool_call|> <|/tool_call|> <|arg_sep|> <|python_tag|>+<tool_call>… - FIM:
<|fim_prefix|> <|fim_suffix|> <|fim_middle|> <|fim_pad|> <|file_sep|> <|repo_name|> - 384 reserved
<|reserved_0..383|>
- chat / turn:
- ~130 600 BPE merges learned from the training mix.
- JSON / markdown / diff / path units are learned by BPE as ordinary merges (not added as special tokens) so they remain content and survive decoding — keeping the tokenizer lossless.
Kazakh morpheme mode
A neural/FST morpheme segmenter can't be serialized inside tokenizer.json, so Kazakh morpheme
awareness is two-stage:
- Segment Kazakh text with the bundled qazcorpora BiLSTM segmenter
(
morpheme_segmenter.py+qazcorpora_model.py+morpho_lemma_suf.pth, trained on QazCorpora BIO tags). It inserts\x1Fbetween morphemes. - Tokenize — the
Split(\x1F)step guarantees no BPE token crosses a morpheme boundary.
from morpheme_segmenter import MorphemeSegmenter
from tokenizers import Tokenizer
seg = MorphemeSegmenter(backend="qazcorpora", qazcorpora_model="morpho_lemma_suf.pth")
tok = Tokenizer.from_file("tokenizer.json")
# Kazakh — segment first (morpheme-aware):
ids = tok.encode(seg.segment("Мектептегі оқушылар математиканы оқыды")).ids
# Other languages / code — tokenize directly:
ids = tok.encode('{"name": "get_weather", "arguments": {"city": "Astana"}}').ids
Training data
Fitted on a ~39 GB balanced sample. The mix is tuned for fertility: Kazakh is agglutinative and needs more merges, while English is frequent and needs a smaller share.
| domain | share | source |
|---|---|---|
| Kazakh | 30% | HPLT2 · mC4 · CulturaX · GlotCC · finepdfs · KazParC · MADLAD-400 v1.5 · Leipzig |
| English | 20% | FineWeb-Edu |
| Russian | 16% | FineWeb-2 (rus_Cyrl) |
| Code | 20% | github-code-clean (permissive licenses) |
| Math | 6% | OpenWebMath · FineMath |
| Agentic | 8% | glaive-function-calling-v2 · hermes-function-calling-v1 · OpenThoughts-114k · dolphin-r1 · commitpackft · cosmopedia |
The agentic slice is real data (tool-call JSON, <think> reasoning, code diffs, markdown), so the
tokenizer learns genuine agent/structural patterns rather than synthetic ones.
Metrics
Fertility = characters per token on real prose (higher = better):
| domain | chars/tok |
|---|---|
| Kazakh | 4.4 (≈ 2.3 tokens/word; 8–9 on common phrases) |
| English | 4.52 |
| Russian | 4.33 |
| Code | 3.63 |
| Math | 3.76 |
| check | result |
|---|---|
| Kazakh morpheme boundary recall | 0.896 |
| lossless roundtrip | ✅ all domains |
| single-digit numbers | ✅ |
| never-UNK (byte-fallback) | ✅ emoji / CJK / rare unicode |
| special tokens = 1 token | ✅ 18/18 |
| reserved slots | 384 |
| JSON tool-args | 3.26 chars/tok |
Design decisions
- 128K vocab. The small-model sweet spot — a larger vocab inflates the tied embedding (already a big share of a ~1.08B model) for little fertility gain.
- Single-digit numbers. Best arithmetic / place-value alignment; number-heavy strings tokenize longer by design.
- Structural units learned by BPE, not added as special tokens. Special tokens are stripped on
decode (
{"name":"x"}→namex), which breaks losslessness; learned merges keep exact text. - Two-stage morpheme mode. Shipping the segmenter alongside is the price of genuine morpheme boundaries in a single portable tokenizer file.
- Kazakh fertility 4.4. The physical result for a balanced multilingual 128K vocab; pushing higher would require a Kazakh-dominant vocab that starves the other domains.
Files
tokenizer.json— the tokenizer (load withtokenizersortransformers).meta.json,eval_full.json— config and full metrics.morpheme_segmenter.py,qazcorpora_model.py,morpho_lemma_suf.pth— Kazakh segmenter (morpheme mode).
License: MIT. Part of the Til program; companion corpus TilQazyna/Til-Corpus.