You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Til-Tokenizer-128k

Morpheme-aware, multilingual, agentic byte-level BPE tokenizer for the Til Kazakh foundation-model program. 128 000 vocab. Kazakh-first, with Russian, English, code, math, and agent/tool-call structure.

Highlights

  • Never emits <unk> — byte-level: every byte is representable, so any unicode, emoji, CJK, or rare symbol round-trips. decode(encode(x)) == x (lossless) on every domain.
  • Single-digit numbers12345 → 1 2 3 4 5, 3.14159 → 3 . 1 4 1 5 9. Gives the model clean place-value alignment for arithmetic.
  • Kazakh morpheme awareness — optional segmentation so no token crosses a morpheme boundary. Boundary recall 0.896 / precision 0.85 / F1 0.87 vs gold (kaz-morphology-sample).
  • Agentic / structural — JSON, tool-call, markdown, diff, and code are tokenized cheaply (learned from real agent data), plus dedicated chat / FIM / reasoning / tool-call control tokens.
  • Code-friendly — indentation runs (4 sp / 8 sp / tab) and operators (== != -> => ** :: && ||) are single tokens.
  • 384 reserved slots — add future special tokens without resizing the embedding.

Design / scheme

Base: byte-level BPE (HF tokenizers). The ByteLevel alphabet covers all 256 bytes, so the tokenizer is lossless and can never UNK.

Pre-tokenizer pipeline (in order):

  1. Split(\x1F) — morpheme-boundary marker (fires only when the input was morpheme-segmented; see below).
  2. Digits(individual_digits=True) — every digit becomes its own token.
  3. ByteLevel(add_prefix_space=False) — preserves whitespace / indentation (code-friendly).

Vocabulary (131 072):

  • ~430 special / control tokens (kept whole):
    • chat / turn: <|im_start|> <|im_end|> + <|system|> <|user|> <|assistant|> <|tool|> <|tool_result|>
    • reasoning: <thinking> </thinking>
    • tool-call control: <|tool_call|> <|/tool_call|> <|arg_sep|> <|python_tag|> + <tool_call>…
    • FIM: <|fim_prefix|> <|fim_suffix|> <|fim_middle|> <|fim_pad|> <|file_sep|> <|repo_name|>
    • 384 reserved <|reserved_0..383|>
  • ~130 600 BPE merges learned from the training mix.
  • JSON / markdown / diff / path units are learned by BPE as ordinary merges (not added as special tokens) so they remain content and survive decoding — keeping the tokenizer lossless.

Kazakh morpheme mode

A neural/FST morpheme segmenter can't be serialized inside tokenizer.json, so Kazakh morpheme awareness is two-stage:

  1. Segment Kazakh text with the bundled qazcorpora BiLSTM segmenter (morpheme_segmenter.py + qazcorpora_model.py + morpho_lemma_suf.pth, trained on QazCorpora BIO tags). It inserts \x1F between morphemes.
  2. Tokenize — the Split(\x1F) step guarantees no BPE token crosses a morpheme boundary.
from morpheme_segmenter import MorphemeSegmenter
from tokenizers import Tokenizer
seg = MorphemeSegmenter(backend="qazcorpora", qazcorpora_model="morpho_lemma_suf.pth")
tok = Tokenizer.from_file("tokenizer.json")

# Kazakh — segment first (morpheme-aware):
ids = tok.encode(seg.segment("Мектептегі оқушылар математиканы оқыды")).ids
# Other languages / code — tokenize directly:
ids = tok.encode('{"name": "get_weather", "arguments": {"city": "Astana"}}').ids

Training data

Fitted on a ~39 GB balanced sample. The mix is tuned for fertility: Kazakh is agglutinative and needs more merges, while English is frequent and needs a smaller share.

domain share source
Kazakh 30% HPLT2 · mC4 · CulturaX · GlotCC · finepdfs · KazParC · MADLAD-400 v1.5 · Leipzig
English 20% FineWeb-Edu
Russian 16% FineWeb-2 (rus_Cyrl)
Code 20% github-code-clean (permissive licenses)
Math 6% OpenWebMath · FineMath
Agentic 8% glaive-function-calling-v2 · hermes-function-calling-v1 · OpenThoughts-114k · dolphin-r1 · commitpackft · cosmopedia

The agentic slice is real data (tool-call JSON, <think> reasoning, code diffs, markdown), so the tokenizer learns genuine agent/structural patterns rather than synthetic ones.

Metrics

Fertility = characters per token on real prose (higher = better):

domain chars/tok
Kazakh 4.4 (≈ 2.3 tokens/word; 8–9 on common phrases)
English 4.52
Russian 4.33
Code 3.63
Math 3.76
check result
Kazakh morpheme boundary recall 0.896
lossless roundtrip ✅ all domains
single-digit numbers
never-UNK (byte-fallback) ✅ emoji / CJK / rare unicode
special tokens = 1 token ✅ 18/18
reserved slots 384
JSON tool-args 3.26 chars/tok

Design decisions

  • 128K vocab. The small-model sweet spot — a larger vocab inflates the tied embedding (already a big share of a ~1.08B model) for little fertility gain.
  • Single-digit numbers. Best arithmetic / place-value alignment; number-heavy strings tokenize longer by design.
  • Structural units learned by BPE, not added as special tokens. Special tokens are stripped on decode ({"name":"x"}namex), which breaks losslessness; learned merges keep exact text.
  • Two-stage morpheme mode. Shipping the segmenter alongside is the price of genuine morpheme boundaries in a single portable tokenizer file.
  • Kazakh fertility 4.4. The physical result for a balanced multilingual 128K vocab; pushing higher would require a Kazakh-dominant vocab that starves the other domains.

Files

  • tokenizer.json — the tokenizer (load with tokenizers or transformers).
  • meta.json, eval_full.json — config and full metrics.
  • morpheme_segmenter.py, qazcorpora_model.py, morpho_lemma_suf.pth — Kazakh segmenter (morpheme mode).

License: MIT. Part of the Til program; companion corpus TilQazyna/Til-Corpus.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support