whispLM-600m

whispLM-600m is a Speech-LLM for low-resource Urdu automatic speech recognition (ASR). It couples a Whisper-tiny acoustic encoder with a Qwen3-0.6B language model decoder via a learned compression projector, trained using a two-stage curriculum.

Architecture

WhispLM Architecture

Results

Evaluated on 1,000 test samples. Normalization: diacritics stripped, NFKC Unicode normalization applied.

Model	WER ↓	CER ↓	RTF ↓
whispLM-600m	0.5566	0.2805	0.1953
voxbridge-37m (baseline)	0.6076	0.2455	0.0268

whispLM-600m achieves 5.11% relative WER reduction over the fine-tuned Whisper-tiny baseline.

Usage

import torch, torchaudio
from transformers import WhisperModel, WhisperProcessor, AutoModelForCausalLM, AutoTokenizer

device = "cuda" if torch.cuda.is_available() else "cpu"
dtype = torch.bfloat16

processor = WhisperProcessor.from_pretrained("mahwizzzz/whispLM-600m", subfolder="encoder")
encoder = WhisperModel.from_pretrained(
    "mahwizzzz/whispLM-600m", subfolder="encoder", torch_dtype=dtype
).encoder.to(device).eval()

tokenizer = AutoTokenizer.from_pretrained("mahwizzzz/whispLM-600m", subfolder="llm")
llm = AutoModelForCausalLM.from_pretrained(
    "mahwizzzz/whispLM-600m", subfolder="llm", torch_dtype=dtype
).to(device).eval()

projector_state = torch.load(
    hf_hub_download("mahwizzzz/whispLM-600m", "projector.pt"), map_location=device
)

audio, sr = torchaudio.load("audio.wav")
if sr != 16000:
    audio = torchaudio.functional.resample(audio, sr, 16000)
feats = processor.feature_extractor(
    audio.squeeze().numpy(), sampling_rate=16000, return_tensors="pt"
).input_features.to(dtype).to(device)

Note: whispLM-600m is a custom Speech-LLM with separate encoder/, llm/, and projector.pt components. It is not compatible with the default transformers.pipeline("automatic-speech-recognition") API.

Citation

@misc{mahwiz2026whisplm,
  title={whispLM: Two-Stage Speech-LLM for Low-Resource Urdu ASR},
  author={Mahwiz Khalil},
  year={2026},
  publisher={HuggingFace},
  url={https://huggingface.co/mahwizzzz/whispLM-600m}
}

Downloads last month: 21

Dataset used to train mahwizzzz/whispLM-600m

Evaluation results

wer on Common Voice Urdu Expanded
test set self-reported

0.557
cer on Common Voice Urdu Expanded
test set self-reported

0.281