Qwen3-ASR-Nepali

Fine-tuned Qwen3-ASR-1.7B for Nepali automatic speech recognition.

Best average WER among 8 tested open-source models across 3 cross-dataset benchmarks. Trained on 157 hours of public Nepali speech data for approximately $7 compute.

Benchmark Results

Cross-dataset evaluation on 3 diverse Nepali speech benchmarks, 100 samples each. None of these datasets were used during training.

Model FLEURS IndicVoices-R OpenSLR-43 Average
Qwen3-ASR-Nepali (ours) 37.0% 55.8% 31.4% 41.4%
Meta MMS-1B (npi) 33.6% 62.4% 40.5% 45.5%
Whisper large-v3 94.0% 96.7% 105.8% 98.8%
Whisper-small-Nepali (amitpant7) 64.5% 77.7% 2.3%* 48.2%
wav2vec2-xlsr-300m (shniranjan) 43.3% 59.5% 33.9%* 45.5%
wav2vec2-nepali (anish) 54.3% 73.7% 4.6%* 44.2%
wav2vec2-xlsr (gagan) 70.8% 86.1% 5.0%* 54.0%
Qwen3-ASR-0.6B Base 116.0% 112.5% 100.4% 109.6%

Models marked with * show anomalously low OpenSLR-43 WER despite high WER on other datasets, suggesting dataset-specific overfitting or training-set overlap.

Datasets:

  • FLEURS โ€” Clean read speech (Google)
  • IndicVoices-R โ€” Spontaneous conversational speech, 2060 speakers (AI4Bharat, NeurIPS 2024)
  • OpenSLR-43 โ€” TTS-generated synthetic speech

Key Results

  • #1 on spontaneous speech โ€” beats MMS-1B by 6.6 points on IndicVoices-R
  • #1 on synthetic speech โ€” beats MMS-1B by 9.1 points on OpenSLR-43
  • #2 on clean read speech โ€” 3.4 points behind MMS-1B on FLEURS
  • Best macro-average WER (41.4%) among all tested models

How to Use

from qwen_asr import Qwen3ASRModel
import torch

model = Qwen3ASRModel.from_pretrained(
    "sidskarki/Qwen3-ASR-Nepali",
    dtype=torch.float16,
    device_map="cuda"
)

result = model.transcribe("audio.wav")
print(result[0].text)

Training Details

Base model Qwen3-ASR-1.7B
Training data OpenSLR-54 (157h Nepali read speech, ~37K utterances)
Hardware A100 80GB, single GPU
Best checkpoint Step 2000
Batch size 16 (effective 128 with gradient accumulation)
Learning rate 2e-5, 3 epochs
Compute cost ~$7

Whisper Evaluation Bug

During benchmarking, we discovered that prior Whisper evaluations for Nepali were affected by a float16 dtype bug. Loading Whisper in float16 causes a dtype mismatch with the float32 processor output, and benchmark scripts using except: pass silently produce empty predictions, giving 100% WER. After fixing this (loading in float32), Whisper large-v3 produces Nepali text but still has high WER due to word boundary and spelling issues.

Links

Citation

@misc{karki2026nepali_asr,
  author = {Karki, Siddhant Singh},
  title = {Nepali ASR: Fine-tuning Qwen3-ASR with Cross-Dataset Evaluation},
  year = {2026},
  url = {https://github.com/sidskarkii/nepali-asr}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for sidskarki/Qwen3-ASR-Nepali

Finetuned
(53)
this model