BrahmicTokenizer-131K

A 131,072-vocabulary byte-level BPE tokenizer that closes the Brahmic compression gap at the 131K-vocabulary class while preserving the English, EU-language, and code compression of OpenAI's o200k_base. Drop-in replacement for any o200k_base training pipeline: same byte-level BPE algorithm, same GPT-2 ByteLevel pre-tokenizer, same decoder, same vocabulary file format.

The model was presented in the paper BrahmicTokenizer-131K: An Indic-Capable Drop-In Replacement for o200k_base.

Citation

@misc{shravan2026brahmictokenizer,
  title={BrahmicTokenizer-131K: An Indic-Capable Drop-In Replacement for o200k\_base},
  author={Rohan Shravan},
  year={2026},
  eprint={2605.29379},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2605.29379}
}

Headline results

On 27 million documents of public Indic pretraining text (2.84 billion words, 46.21 GB):

26.7% fewer tokens than Mistral-Nemo Tekken / Sarvam-m at the same 131K vocab budget
Per-language savings 15.79% (Tamil) to 76.79% (Odia, a 4.31× compression ratio)
Holds on 11 of 11 Brahmic languages with no exceptions

On non-Indic content (FLORES-200, HumanEval, MBPP, GSM8K):

English fertility 1.235 tokens/word — matches o200k_base (1.232)
Best-in-class code/math compression at the 131K vocab class (0.295 / 0.320 / 0.301 tokens/char on HumanEval / MBPP / GSM8K)
Beats Tekken/Sarvam-m by 4.0–14.2% on HumanEval, MBPP, GSM8K
EU language fertility within 3% of best (French 1.464, German 1.653, Spanish 1.388)

On FLORES-200 dev+devtest Brahmic fertility (rank 4 of 11 publicly downloadable tokenizers):

BrahmicTokenizer-131K 2.84 mean Brahmic fertility vs Tekken/Sarvam-m's 4.87 — a 41.8% relative improvement at the same vocab budget
Beats Tekken/Sarvam-m on every Brahmic language individually (Or 77%, As 42%, Gu 37%, Ml 27%, …)

Across the 14-tokenizer benchmark, BrahmicTokenizer-131K is the only tokenizer simultaneously competitive on Brahmic, English, EU, code, and math at the 131K vocabulary budget.

Usage

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("theschoolofai/BrahmicTokenizer-131K")

# Hindi
print(tokenizer.encode("भारत एक देश है", add_special_tokens=False))
# -> [66526, 2420, 13092, 732]

# Digit grouping (inherited bit-identically from o200k_base)
print(tokenizer.encode("1234567890", add_special_tokens=False))
# -> [4660, 14932, 23133, 26]
# Decoded: ['123', '456', '789', '0']

Construction

Two-stage surgical retrofit of o200k_base:

Stage 1 — script-prune crop: removed 38,345 tokens covering 9 non-target scripts (CJK Unified Ideographs, Hangul, Hiragana+Katakana, Arabic, Cyrillic, Thai, Greek, Hebrew, Sinhala), reducing 200,019 → 131,072 slots and forming o200k_cropped.
Stage 2 — surgical Brahmic retrofit: replaced 2,372 corpus-dead vocabulary slots in o200k_cropped with high-frequency Brahmic content, allocated across the 9 Brahmic scripts by linear-programming optimization on a 1.045-billion-token audit corpus.

The pre-tokenizer, decoder, and English/EU/code merge rules are inherited unchanged from o200k_base. The vocabulary content differs in 40,717 of 131,072 slots, but the tokenizer-side interface (algorithm, pre-tokenizer regex, decoder, special-token format, JSON schema) is identical.

Structural properties

Every normal token ≤ 32 UTF-8 bytes (max 32, longest is a 32-space filler)
Zero tokens spanning two disjoint writing systems
Zero cross-script merge rules in the 301,398-entry merge list

These two properties make BrahmicTokenizer-131K and o200k_cropped the only two of 14 publicly-available tokenizers we benchmarked to satisfy both constraints simultaneously, which matters for byte-pooled embedding architectures with a fixed per-token-byte budget.

Files

tokenizer.json — the BPE artifact (8.0 MB, vocab 131,072, merges 301,398, added tokens 356)
tokenizer_config.json — HuggingFace AutoTokenizer configuration
special_tokens_map.json — BOS/EOS/PAD/UNK declarations
LICENSE — Apache 2.0

The reproduction scripts (verification, fertility evaluation, 27M-corpus tokenization, 23-test audit) live in the GitHub repository: https://github.com/theschoolofai/BrahmicTokenizer-131K.

License

Apache License 2.0. This work is a derivative of OpenAI's o200k_base tokenizer, released through the MIT-licensed tiktoken repository; Apache 2.0 is compatible with incorporating MIT-licensed material. The bundled Brahmic-script fonts referenced in the paper (NotoSansDevanagari, NotoSansBengali, NotoSansOriya, NotoSansTamil) are redistributed under the SIL Open Font License 1.1.

Downloads last month: -; Downloads are not tracked for this model. How to track

Paper for theschoolofai/BrahmicTokenizer-131K

BrahmicTokenizer-131K: An Indic-Capable Drop-In Replacement for o200k_base

Paper • 2605.29379 • Published 2 days ago