Qwen3-4B Nepali — Extended Tokenizer + CPT + SFT

Qwen3-4B with 15K added Nepali tokens, continued pretraining on Nepali text, and instruction fine-tuning on 113K Nepali instruction pairs.

The extended tokenizer cuts Nepali token count by 48% (6.13 → 3.18 tokens/word on a 2,000-doc Nepali CC-100 benchmark split). The model was then trained via LoRA CPT and SFT so it can actually use those new tokens.

How to Load

Loading requires both the SFT adapter and the CPT checkpoint. The CPT checkpoint contains trained embeddings for the new Nepali tokens that must be manually restored after loading the adapter.

import os
import torch
from huggingface_hub import snapshot_download
from peft import PeftModel
from safetensors.torch import load_file
from transformers import AutoModelForCausalLM, AutoTokenizer

# Download this repo
repo_path = snapshot_download("sidskarki/qwen3-4b-nepali")

# Load tokenizer and base model
tokenizer = AutoTokenizer.from_pretrained(
    os.path.join(repo_path, "sft-adapter"), trust_remote_code=True,
)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3-4B", torch_dtype=torch.bfloat16,
    trust_remote_code=True, device_map="auto",
)

# Resize embeddings and load SFT LoRA adapter
base_vocab_size = model.get_input_embeddings().weight.shape[0]  # 151936
model.resize_token_embeddings(len(tokenizer))  # 166925
model = PeftModel.from_pretrained(
    model, os.path.join(repo_path, "sft-adapter"),
    torch_dtype=torch.bfloat16, is_trainable=False,
)

# Restore trained new-token embeddings from CPT checkpoint
cpt_sd = load_file(os.path.join(repo_path, "cpt-checkpoint", "adapter_model.safetensors"))
bm = model.get_base_model()
with torch.no_grad():
    bm.get_input_embeddings().weight[base_vocab_size:len(tokenizer)].copy_(
        cpt_sd["base_model.model.model.embed_tokens.new_weight"].to(
            device=bm.get_input_embeddings().weight.device, dtype=torch.bfloat16))
    bm.get_output_embeddings().weight[base_vocab_size:len(tokenizer)].copy_(
        cpt_sd["base_model.model.lm_head.trainable_embedding.new_weight"].to(
            device=bm.get_output_embeddings().weight.device, dtype=torch.bfloat16))

model.eval()

Why the Manual Embedding Restoration?

During CPT, we used a custom TrainableTokenEmbedding wrapper that freezes the original 151K base embedding rows and only trains the 15K new Nepali rows. This avoids training the full 166K embedding matrix (1.4B params via PEFT's modules_to_save), cutting it to 38M params and reducing step time from ~70s to ~20s on an A40.

The trade-off is that PeftModel.from_pretrained doesn't know about this custom wrapper, so the trained new_weight tensors must be extracted from the CPT checkpoint's safetensors file and copied into the resized embedding matrix manually.

See cpt_train.py in the code repository for the full TrainableTokenEmbedding implementation.

Generate

prompt = "### Instruction:\nनेपालको राजधानी के हो?\n\n### Response:\n"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
    output = model.generate(
        **inputs, max_new_tokens=100, do_sample=True,
        temperature=0.7, top_p=0.9, repetition_penalty=1.1,
        pad_token_id=tokenizer.pad_token_id,
    )
print(tokenizer.decode(output[0], skip_special_tokens=True))

Training Details

Tokenizer Extension

Base: Qwen3 tokenizer (151,669 tokens)
Added: 15,256 Nepali tokens selected from a 32K SentencePiece BPE model
Selection: tokens that the base tokenizer splits into 3+ subtokens
Extended vocab: 166,925 tokens
Result: 48.1% fewer tokens on Nepali text (6.13 → 3.18 tok/word)

The SentencePiece model was trained on a 7.49GB cleaned Nepali corpus assembled from CulturaX, CC-100, ai4bharat/sangraha, and Nepali books.

Continued Pretraining (CPT)

Method: LoRA (r=64, alpha=128) on q/k/v/o/gate/up/down projections + TrainableTokenEmbedding for new rows only
Data: ~800M chars Nepali (CulturaX) + ~200M chars English (FineWeb, 20% mix for catastrophic forgetting prevention)
Embedding init: Mean-of-subword — each new token initialized from the average embedding of its base-tokenizer decomposition
Optimizer: AdamW with split lr — 2e-4 for LoRA, 5e-4 for new embeddings
Schedule: Cosine with 3% warmup, 3000 steps
Batch: 2 per device, 8 gradient accumulation (effective batch 16)
Sequence length: 2048 (packed)
Precision: bf16
Hardware: Single NVIDIA A40 (46GB), ~17 hours
Loss: 3.86 → 1.53

Supervised Fine-Tuning (SFT)

Data: sharad461/wiseyak-sft-nepali — 113K Nepali instruction pairs in Alpaca format
Method: LoRA (r=64, alpha=128) on same projections, with modules_to_save for embeddings
Optimizer: AdamW, lr=5e-5, cosine schedule, 5% warmup
Steps: 1500, batch 2, grad_accum 8
Hardware: Single NVIDIA A40, ~2.75 hours
Loss: 1.50 → 1.09

Total Compute

~~$7 on Vast.ai (~~20 hours A40 time).

Tokenizer Benchmark Context

This model was built as part of a 17-model Nepali tokenizer benchmark. The benchmark measures nepali_tax = nepali_tokens_per_word / english_tokens_per_word — how much more expensive Nepali is to process than English.

Model	Nepali Tax	Nepali tok/word
Phi-4 (worst)	5.7x	7.17
Qwen 3 (base, before extension)	4.9x	6.10
Qwen 3 + Nepali extension	—	3.18
Gemma 4 (best stock)	2.0x	2.52

Full benchmark results, code, and methodology: github.com/sidskarkii/nepali-tokenizer

Repo Contents

sft-adapter/              SFT LoRA adapter + tokenizer files
cpt-checkpoint/           CPT LoRA adapter (contains new_weight embeddings)

Limitations

This is a 4B parameter model with LoRA adapters, not a full fine-tune.
Nepali instruction-following quality is limited by the SFT dataset size and diversity.
The model may produce Hindi-influenced text in some contexts.
Perplexity is not directly comparable across different tokenizers — use bits-per-character for fair cross-tokenizer comparison.
The manual embedding restoration step is required and cannot be skipped.

Citation

@misc{karki2026nepali,
  author = {Karki, Siddhant Singh},
  title = {Nepali Tokenizer Infrastructure: Benchmarking and Extending LLM Tokenizers for Nepali},
  year = {2026},
  url = {https://github.com/sidskarkii/nepali-tokenizer}
}

Downloads last month: -

Model tree for sidskarki/qwen3-4b-nepali

Base model

Qwen/Qwen3-4B-Base

Finetuned

Qwen/Qwen3-4B

Adapter

(1022)

this model

sidskarki
/

qwen3-4b-nepali