BrahmicTokenizer-131K

A 131,072-vocabulary byte-level BPE tokenizer that closes the Brahmic compression gap at the 131K-vocabulary class while preserving the English, EU-language, and code compression of OpenAI's o200k_base. Drop-in replacement for any o200k_base training pipeline: same byte-level BPE algorithm, same GPT-2 ByteLevel pre-tokenizer, same decoder, same vocabulary file format.

The model was presented in the paper BrahmicTokenizer-131K: An Indic-Capable Drop-In Replacement for o200k_base.

Citation

@misc{shravan2026brahmictokenizer,
  title={BrahmicTokenizer-131K: An Indic-Capable Drop-In Replacement for o200k\_base},
  author={Rohan Shravan},
  year={2026},
  eprint={2605.29379},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2605.29379}
}

Headline results

On 27 million documents of public Indic pretraining text (2.84 billion words, 46.21 GB):

  • 26.7% fewer tokens than Mistral-Nemo Tekken / Sarvam-m at the same 131K vocab budget
  • Per-language savings 15.79% (Tamil) to 76.79% (Odia, a 4.31× compression ratio)
  • Holds on 11 of 11 Brahmic languages with no exceptions

On non-Indic content (FLORES-200, HumanEval, MBPP, GSM8K):

  • English fertility 1.235 tokens/word — matches o200k_base (1.232)
  • Best-in-class code/math compression at the 131K vocab class (0.295 / 0.320 / 0.301 tokens/char on HumanEval / MBPP / GSM8K)
  • Beats Tekken/Sarvam-m by 4.0–14.2% on HumanEval, MBPP, GSM8K
  • EU language fertility within 3% of best (French 1.464, German 1.653, Spanish 1.388)

On FLORES-200 dev+devtest Brahmic fertility (rank 4 of 11 publicly downloadable tokenizers):

  • BrahmicTokenizer-131K 2.84 mean Brahmic fertility vs Tekken/Sarvam-m's 4.87 — a 41.8% relative improvement at the same vocab budget
  • Beats Tekken/Sarvam-m on every Brahmic language individually (Or 77%, As 42%, Gu 37%, Ml 27%, …)

Across the 14-tokenizer benchmark, BrahmicTokenizer-131K is the only tokenizer simultaneously competitive on Brahmic, English, EU, code, and math at the 131K vocabulary budget.

Usage

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("theschoolofai/BrahmicTokenizer-131K")

# Hindi
print(tokenizer.encode("भारत एक देश है", add_special_tokens=False))
# -> [66526, 2420, 13092, 732]

# Digit grouping (inherited bit-identically from o200k_base)
print(tokenizer.encode("1234567890", add_special_tokens=False))
# -> [4660, 14932, 23133, 26]
# Decoded: ['123', '456', '789', '0']

Vocabulary: 131,072 tokens. Specials: 356 added tokens including the standard EOS (<|end_of_text|>, ID 36), BOS (<|begin_of_text|>, ID 130725), PAD (<|pad|>, ID 130726), UNK (<|unk|>, ID 130727), FIM, multimodal, and reserved-slot markers.

Construction

Two-stage surgical retrofit of o200k_base:

  1. Stage 1 — script-prune crop: removed 38,345 tokens covering 9 non-target scripts (CJK Unified Ideographs, Hangul, Hiragana+Katakana, Arabic, Cyrillic, Thai, Greek, Hebrew, Sinhala), reducing 200,019 → 131,072 slots and forming o200k_cropped.
  2. Stage 2 — surgical Brahmic retrofit: replaced 2,372 corpus-dead vocabulary slots in o200k_cropped with high-frequency Brahmic content, allocated across the 9 Brahmic scripts by linear-programming optimization on a 1.045-billion-token audit corpus.

The pre-tokenizer, decoder, and English/EU/code merge rules are inherited unchanged from o200k_base. The vocabulary content differs in 40,717 of 131,072 slots, but the tokenizer-side interface (algorithm, pre-tokenizer regex, decoder, special-token format, JSON schema) is identical.

Structural properties

  • Every normal token ≤ 32 UTF-8 bytes (max 32, longest is a 32-space filler)
  • Zero tokens spanning two disjoint writing systems
  • Zero cross-script merge rules in the 301,398-entry merge list

These two properties make BrahmicTokenizer-131K and o200k_cropped the only two of 14 publicly-available tokenizers we benchmarked to satisfy both constraints simultaneously, which matters for byte-pooled embedding architectures with a fixed per-token-byte budget.

Files

  • tokenizer.json — the BPE artifact (8.0 MB, vocab 131,072, merges 301,398, added tokens 356)
  • tokenizer_config.json — HuggingFace AutoTokenizer configuration
  • special_tokens_map.json — BOS/EOS/PAD/UNK declarations
  • LICENSE — Apache 2.0

The reproduction scripts (verification, fertility evaluation, 27M-corpus tokenization, 23-test audit) live in the GitHub repository: https://github.com/theschoolofai/BrahmicTokenizer-131K.

License

Apache License 2.0. This work is a derivative of OpenAI's o200k_base tokenizer, released through the MIT-licensed tiktoken repository; Apache 2.0 is compatible with incorporating MIT-licensed material. The bundled Brahmic-script fonts referenced in the paper (NotoSansDevanagari, NotoSansBengali, NotoSansOriya, NotoSansTamil) are redistributed under the SIL Open Font License 1.1.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for theschoolofai/BrahmicTokenizer-131K