Qwen3-ASR-Nepali
Fine-tuned Qwen3-ASR-1.7B for Nepali automatic speech recognition.
Best average WER among 8 tested open-source models across 3 cross-dataset benchmarks. Trained on 157 hours of public Nepali speech data for approximately $7 compute.
Benchmark Results
Cross-dataset evaluation on 3 diverse Nepali speech benchmarks, 100 samples each. None of these datasets were used during training.
| Model | FLEURS | IndicVoices-R | OpenSLR-43 | Average |
|---|---|---|---|---|
| Qwen3-ASR-Nepali (ours) | 37.0% | 55.8% | 31.4% | 41.4% |
| Meta MMS-1B (npi) | 33.6% | 62.4% | 40.5% | 45.5% |
| Whisper large-v3 | 94.0% | 96.7% | 105.8% | 98.8% |
| Whisper-small-Nepali (amitpant7) | 64.5% | 77.7% | 2.3%* | 48.2% |
| wav2vec2-xlsr-300m (shniranjan) | 43.3% | 59.5% | 33.9%* | 45.5% |
| wav2vec2-nepali (anish) | 54.3% | 73.7% | 4.6%* | 44.2% |
| wav2vec2-xlsr (gagan) | 70.8% | 86.1% | 5.0%* | 54.0% |
| Qwen3-ASR-0.6B Base | 116.0% | 112.5% | 100.4% | 109.6% |
Models marked with * show anomalously low OpenSLR-43 WER despite high WER on other datasets, suggesting dataset-specific overfitting or training-set overlap.
Datasets:
- FLEURS โ Clean read speech (Google)
- IndicVoices-R โ Spontaneous conversational speech, 2060 speakers (AI4Bharat, NeurIPS 2024)
- OpenSLR-43 โ TTS-generated synthetic speech
Key Results
- #1 on spontaneous speech โ beats MMS-1B by 6.6 points on IndicVoices-R
- #1 on synthetic speech โ beats MMS-1B by 9.1 points on OpenSLR-43
- #2 on clean read speech โ 3.4 points behind MMS-1B on FLEURS
- Best macro-average WER (41.4%) among all tested models
How to Use
from qwen_asr import Qwen3ASRModel
import torch
model = Qwen3ASRModel.from_pretrained(
"sidskarki/Qwen3-ASR-Nepali",
dtype=torch.float16,
device_map="cuda"
)
result = model.transcribe("audio.wav")
print(result[0].text)
Training Details
| Base model | Qwen3-ASR-1.7B |
| Training data | OpenSLR-54 (157h Nepali read speech, ~37K utterances) |
| Hardware | A100 80GB, single GPU |
| Best checkpoint | Step 2000 |
| Batch size | 16 (effective 128 with gradient accumulation) |
| Learning rate | 2e-5, 3 epochs |
| Compute cost | ~$7 |
Whisper Evaluation Bug
During benchmarking, we discovered that prior Whisper evaluations for Nepali were affected by a float16 dtype bug. Loading Whisper in float16 causes a dtype mismatch with the float32 processor output, and benchmark scripts using except: pass silently produce empty predictions, giving 100% WER. After fixing this (loading in float32), Whisper large-v3 produces Nepali text but still has high WER due to word boundary and spelling issues.
Links
- Code: github.com/sidskarkii/nepali-asr
- Case study: siddhantskarki.com/case-studies/nepali-asr
- Portfolio: siddhantskarki.com
Citation
@misc{karki2026nepali_asr,
author = {Karki, Siddhant Singh},
title = {Nepali ASR: Fine-tuning Qwen3-ASR with Cross-Dataset Evaluation},
year = {2026},
url = {https://github.com/sidskarkii/nepali-asr}
}
Model tree for sidskarki/Qwen3-ASR-Nepali
Base model
Qwen/Qwen3-ASR-1.7B