Qwen 3.5 — Nepali Extended Tokenizer

Extended tokenizer for Qwen 3.5 with ~15K added high-value Nepali/Devanagari tokens.

Token Efficiency

	Nepali tok/word
Original Qwen 3.5	4.49
Extended (this)	2.50
Reduction	44.3%

How It Was Built

Trained a 32K SentencePiece BPE tokenizer on a 7.49GB cleaned Nepali corpus
Selected tokens that the Qwen 3.5 base tokenizer splits into 3+ subtokens (delta vocabulary approach)
Added ~15K high-value Nepali tokens to the base tokenizer

The extended tokenizer is a drop-in replacement for the original. To use the new tokens effectively, the model needs continued pretraining on Nepali text (see the Qwen3-4B Nepali model for a full CPT+SFT example).

Usage

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("sidskarki/qwen3.5-nepali-tokenizer")
tokens = tokenizer.tokenize("नेपालको राजधानी काठमाडौं हो")
print(tokens, len(tokens))

Context

Part of a 17-model Nepali tokenizer benchmark measuring the Nepali token tax across modern LLM tokenizers.

sidskarki
/

qwen3.5-nepali-tokenizer

Qwen 3.5 — Nepali Extended Tokenizer

Token Efficiency

How It Was Built

Usage

Context

Links