Instructions to use theschoolofai/BrahmicTokenizer-131K with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use theschoolofai/BrahmicTokenizer-131K with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("feature-extraction", model="theschoolofai/BrahmicTokenizer-131K")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("theschoolofai/BrahmicTokenizer-131K", dtype="auto") - Notebooks
- Google Colab
- Kaggle
BrahmicTokenizer-131K
A 131,072-vocabulary byte-level BPE tokenizer that closes the Brahmic compression gap at the 131K-vocabulary class while preserving the English, EU-language, and code compression of OpenAI's o200k_base. Drop-in replacement for any o200k_base training pipeline: same byte-level BPE algorithm, same GPT-2 ByteLevel pre-tokenizer, same decoder, same vocabulary file format.
The model was presented in the paper BrahmicTokenizer-131K: An Indic-Capable Drop-In Replacement for o200k_base.
Citation
@misc{shravan2026brahmictokenizer,
title={BrahmicTokenizer-131K: An Indic-Capable Drop-In Replacement for o200k\_base},
author={Rohan Shravan},
year={2026},
eprint={2605.29379},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2605.29379}
}
Headline results
On 27 million documents of public Indic pretraining text (2.84 billion words, 46.21 GB):
- 26.7% fewer tokens than Mistral-Nemo Tekken / Sarvam-m at the same 131K vocab budget
- Per-language savings 15.79% (Tamil) to 76.79% (Odia, a 4.31× compression ratio)
- Holds on 11 of 11 Brahmic languages with no exceptions
On non-Indic content (FLORES-200, HumanEval, MBPP, GSM8K):
- English fertility 1.235 tokens/word — matches o200k_base (1.232)
- Best-in-class code/math compression at the 131K vocab class (0.295 / 0.320 / 0.301 tokens/char on HumanEval / MBPP / GSM8K)
- Beats Tekken/Sarvam-m by 4.0–14.2% on HumanEval, MBPP, GSM8K
- EU language fertility within 3% of best (French 1.464, German 1.653, Spanish 1.388)
On FLORES-200 dev+devtest Brahmic fertility (rank 4 of 11 publicly downloadable tokenizers):
- BrahmicTokenizer-131K 2.84 mean Brahmic fertility vs Tekken/Sarvam-m's 4.87 — a 41.8% relative improvement at the same vocab budget
- Beats Tekken/Sarvam-m on every Brahmic language individually (Or 77%, As 42%, Gu 37%, Ml 27%, …)
Across the 14-tokenizer benchmark, BrahmicTokenizer-131K is the only tokenizer simultaneously competitive on Brahmic, English, EU, code, and math at the 131K vocabulary budget.
Usage
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("theschoolofai/BrahmicTokenizer-131K")
# Hindi
print(tokenizer.encode("भारत एक देश है", add_special_tokens=False))
# -> [66526, 2420, 13092, 732]
# Digit grouping (inherited bit-identically from o200k_base)
print(tokenizer.encode("1234567890", add_special_tokens=False))
# -> [4660, 14932, 23133, 26]
# Decoded: ['123', '456', '789', '0']
Vocabulary: 131,072 tokens. Specials: 356 added tokens including the standard EOS (<|end_of_text|>, ID 36), BOS (<|begin_of_text|>, ID 130725), PAD (<|pad|>, ID 130726), UNK (<|unk|>, ID 130727), FIM, multimodal, and reserved-slot markers.
Construction
Two-stage surgical retrofit of o200k_base:
- Stage 1 — script-prune crop: removed 38,345 tokens covering 9 non-target scripts (CJK Unified Ideographs, Hangul, Hiragana+Katakana, Arabic, Cyrillic, Thai, Greek, Hebrew, Sinhala), reducing 200,019 → 131,072 slots and forming
o200k_cropped. - Stage 2 — surgical Brahmic retrofit: replaced 2,372 corpus-dead vocabulary slots in
o200k_croppedwith high-frequency Brahmic content, allocated across the 9 Brahmic scripts by linear-programming optimization on a 1.045-billion-token audit corpus.
The pre-tokenizer, decoder, and English/EU/code merge rules are inherited unchanged from o200k_base. The vocabulary content differs in 40,717 of 131,072 slots, but the tokenizer-side interface (algorithm, pre-tokenizer regex, decoder, special-token format, JSON schema) is identical.
Structural properties
- Every normal token ≤ 32 UTF-8 bytes (max 32, longest is a 32-space filler)
- Zero tokens spanning two disjoint writing systems
- Zero cross-script merge rules in the 301,398-entry merge list
These two properties make BrahmicTokenizer-131K and o200k_cropped the only two of 14 publicly-available tokenizers we benchmarked to satisfy both constraints simultaneously, which matters for byte-pooled embedding architectures with a fixed per-token-byte budget.
Files
tokenizer.json— the BPE artifact (8.0 MB, vocab 131,072, merges 301,398, added tokens 356)tokenizer_config.json— HuggingFaceAutoTokenizerconfigurationspecial_tokens_map.json— BOS/EOS/PAD/UNK declarationsLICENSE— Apache 2.0
The reproduction scripts (verification, fertility evaluation, 27M-corpus tokenization, 23-test audit) live in the GitHub repository: https://github.com/theschoolofai/BrahmicTokenizer-131K.
License
Apache License 2.0. This work is a derivative of OpenAI's o200k_base tokenizer, released through the MIT-licensed tiktoken repository; Apache 2.0 is compatible with incorporating MIT-licensed material. The bundled Brahmic-script fonts referenced in the paper (NotoSansDevanagari, NotoSansBengali, NotoSansOriya, NotoSansTamil) are redistributed under the SIL Open Font License 1.1.