Qwen3-ASR-Nepali

Fine-tuned Qwen3-ASR-1.7B for Nepali automatic speech recognition.

Best average WER among 8 tested open-source models across 3 cross-dataset benchmarks. Trained on 157 hours of public Nepali speech data for approximately $7 compute.

Benchmark Results

Cross-dataset evaluation on 3 diverse Nepali speech benchmarks, 100 samples each. None of these datasets were used during training.

Model	FLEURS	IndicVoices-R	OpenSLR-43	Average
Qwen3-ASR-Nepali (ours)	37.0%	55.8%	31.4%	41.4%
Meta MMS-1B (npi)	33.6%	62.4%	40.5%	45.5%
Whisper large-v3	94.0%	96.7%	105.8%	98.8%
Whisper-small-Nepali (amitpant7)	64.5%	77.7%	2.3%*	48.2%
wav2vec2-xlsr-300m (shniranjan)	43.3%	59.5%	33.9%*	45.5%
wav2vec2-nepali (anish)	54.3%	73.7%	4.6%*	44.2%
wav2vec2-xlsr (gagan)	70.8%	86.1%	5.0%*	54.0%
Qwen3-ASR-0.6B Base	116.0%	112.5%	100.4%	109.6%

Models marked with * show anomalously low OpenSLR-43 WER despite high WER on other datasets, suggesting dataset-specific overfitting or training-set overlap.

Datasets:

FLEURS — Clean read speech (Google)
IndicVoices-R — Spontaneous conversational speech, 2060 speakers (AI4Bharat, NeurIPS 2024)
OpenSLR-43 — TTS-generated synthetic speech

Key Results

#1 on spontaneous speech — beats MMS-1B by 6.6 points on IndicVoices-R
#1 on synthetic speech — beats MMS-1B by 9.1 points on OpenSLR-43
#2 on clean read speech — 3.4 points behind MMS-1B on FLEURS
Best macro-average WER (41.4%) among all tested models

How to Use

from qwen_asr import Qwen3ASRModel
import torch

model = Qwen3ASRModel.from_pretrained(
    "sidskarki/Qwen3-ASR-Nepali",
    dtype=torch.float16,
    device_map="cuda"
)

result = model.transcribe("audio.wav")
print(result[0].text)

Training Details


Base model	Qwen3-ASR-1.7B
Training data	OpenSLR-54 (157h Nepali read speech, ~37K utterances)
Hardware	A100 80GB, single GPU
Best checkpoint	Step 2000
Batch size	16 (effective 128 with gradient accumulation)
Learning rate	2e-5, 3 epochs
Compute cost	~$7

Whisper Evaluation Bug

During benchmarking, we discovered that prior Whisper evaluations for Nepali were affected by a float16 dtype bug. Loading Whisper in float16 causes a dtype mismatch with the float32 processor output, and benchmark scripts using except: pass silently produce empty predictions, giving 100% WER. After fixing this (loading in float32), Whisper large-v3 produces Nepali text but still has high WER due to word boundary and spelling issues.

Citation

@misc{karki2026nepali_asr,
  author = {Karki, Siddhant Singh},
  title = {Nepali ASR: Fine-tuning Qwen3-ASR with Cross-Dataset Evaluation},
  year = {2026},
  url = {https://github.com/sidskarkii/nepali-asr}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for sidskarki/Qwen3-ASR-Nepali

Base model

Qwen/Qwen3-ASR-1.7B

Finetuned

(53)

this model

sidskarki
/

Qwen3-ASR-Nepali