Qwen3-ASR-1.7B β Somali (merged)
Somali ASR model: Qwen/Qwen3-ASR-1.7B fine-tuned (LoRA, merged) for Somali. The base model does
not know Somali (emits Arabic-script output); this model transcribes Somali in Latin script and is
vLLM-servable (unlike Wav2Vec2-CTC models such as MMS). LoRA adapter: chrullis/Qwen3-ASR-1.7B-Somali-LoRA.
Results (FLEURS Somali test + held-out sets; WER / CER)
| Model | FLEURS (clean) | DDD-Kenya (held-out, diverse) | afri-voices (held-out, naturalistic) | vLLM-servable |
|---|---|---|---|---|
| base Qwen3-ASR-1.7B | 1.008 (Arabic script) | β | β | yes |
| MMS-1b-all | 0.453 | β | β | no |
| This model | 0.552 / 0.189 | 0.385 / 0.129 | 1.05β1.14 / 0.59β0.70 | yes |
On held-out, diverse, real-domain Somali (DDD-Kenya) it beats MMS (0.385 vs 0.453 WER) β genuine generalization, not memorization. CER βͺ WER because Somali Latin spelling varies.
Honest limitations
- In-distribution-dependent, not robust. Strong on clean/read & DDD-domain Somali; degrades and can loop (repetition collapse) on harder/naturalistic audio (~40% of afri-voices clips).
- Best-case CER ~0.13β0.19 β useful, not fluent. Small eval sets (n=20β120).
Recommended decode (already set in generation_config)
no_repeat_ngram_size=3, repetition_penalty=1.0. (no_repeat helps modestly on hard audio; a
repetition_penalty > 1 backfires into hallucination β verified.)
Usage
from qwen_asr import Qwen3ASRModel
m = Qwen3ASRModel.from_pretrained("chrullis/Qwen3-ASR-1.7B-Somali", dtype="bfloat16", device_map="cuda")
print(m.transcribe(audio=("clip.wav"), language="so")[0].text)
Training data
FLEURS so_so + shunyalabs/somali-speech-dataset + badrex/afri-voices-somali-speech +
DDD-Kenya/Somali-ASR-Subset-68H (4 shards). ~5,900 clips, LoRA r=16 all-linear, 3,500 steps.
Use it
Install the Qwen3-ASR library:
pip install -U qwen-asr
Transcribe (merged model β recommended):
from qwen_asr import Qwen3ASRModel
model = Qwen3ASRModel.from_pretrained(
"chrullis/Qwen3-ASR-1.7B-Somali",
dtype="bfloat16", device_map="cuda:0", max_new_tokens=128,
)
result = model.transcribe(audio="clip.wav", language="so") # path, URL, or (np.ndarray, sr)
print(result[0].text)
Using the LoRA adapter directly
Most users should just use the merged model above. Use the LoRA only if you want to merge it yourself (or stack it onto another checkpoint). Full, runnable merge:
import torch, types
import torch.nn as nn
from peft import PeftModel
from qwen_asr import Qwen3ASRModel
from qwen_asr.core.transformers_backend.modeling_qwen3_asr import Qwen3ASRForConditionalGeneration
from qwen_asr.core.transformers_backend.processing_qwen3_asr import Qwen3ASRProcessor
BASE = "Qwen/Qwen3-ASR-1.7B"
ADAPTER = "chrullis/Qwen3-ASR-1.7B-Somali-LoRA"
# 1) load base, apply + merge the LoRA into the decoder ("thinker")
top = Qwen3ASRForConditionalGeneration.from_pretrained(BASE, dtype=torch.bfloat16)
tok_emb = max((m for m in top.thinker.modules() if isinstance(m, nn.Embedding)),
key=lambda e: e.num_embeddings)
top.thinker.get_input_embeddings = types.MethodType(lambda self: tok_emb, top.thinker) # qwen-asr quirk
top.thinker = PeftModel.from_pretrained(top.thinker, ADAPTER).merge_and_unload()
# 2) keep decoding greedy (sampling makes transcribe ramble), then save a usable model dir
gc = top.generation_config
gc.do_sample = False; gc.temperature = None; gc.top_p = None; gc.top_k = None
gc.no_repeat_ngram_size = 3
top.save_pretrained("somali_merged")
Qwen3ASRProcessor.from_pretrained(BASE).save_pretrained("somali_merged")
# 3) load the merged dir for transcription
model = Qwen3ASRModel.from_pretrained("somali_merged", dtype="bfloat16",
device_map="cuda:0", max_new_tokens=128)
print(model.transcribe(audio="clip.wav", language="so")[0].text)
Recommended decode: no_repeat_ngram_size=3, repetition_penalty=1.0 (already set in this repo's
generation_config.json). A repetition_penalty > 1 backfires into hallucination β do not raise it.
Host it (vLLM, OpenAI-compatible)
Qwen3-ASR serves under vLLM via the OpenAI /v1/audio/transcriptions endpoint:
pip install vllm # install on an ext4 filesystem; some encrypted homes break flashinfer's long filenames
vllm serve chrullis/Qwen3-ASR-1.7B-Somali \
--served-model-name qwen3-asr-somali \
--dtype float16 --gpu-memory-utilization 0.80 --max-model-len 4096
Query it (any OpenAI-compatible client):
curl -s http://localhost:8000/v1/audio/transcriptions \
-F file=@clip.wav -F model=qwen3-asr-somali -F language=so -F response_format=json
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="x")
print(client.audio.transcriptions.create(model="qwen3-asr-somali",
file=open("clip.wav","rb"), language="so").text)
Audio must be 16 kHz mono (resample first if needed). The model is small (1.7B, ~4 GB fp16) and fits an 8 GB GPU.
- Downloads last month
- 3
Model tree for chrullis/Qwen3-ASR-1.7B-Somali
Base model
Qwen/Qwen3-ASR-1.7B