Qwen 3.5 — Nepali Extended Tokenizer

Extended tokenizer for Qwen 3.5 with ~15K added high-value Nepali/Devanagari tokens.

Token Efficiency

Nepali tok/word
Original Qwen 3.5 4.49
Extended (this) 2.50
Reduction 44.3%

How It Was Built

  1. Trained a 32K SentencePiece BPE tokenizer on a 7.49GB cleaned Nepali corpus
  2. Selected tokens that the Qwen 3.5 base tokenizer splits into 3+ subtokens (delta vocabulary approach)
  3. Added ~15K high-value Nepali tokens to the base tokenizer

The extended tokenizer is a drop-in replacement for the original. To use the new tokens effectively, the model needs continued pretraining on Nepali text (see the Qwen3-4B Nepali model for a full CPT+SFT example).

Usage

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("sidskarki/qwen3.5-nepali-tokenizer")
tokens = tokenizer.tokenize("नेपालको राजधानी काठमाडौं हो")
print(tokens, len(tokens))

Context

Part of a 17-model Nepali tokenizer benchmark measuring the Nepali token tax across modern LLM tokenizers.

Links

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support