Qwen3-ASR-1.7B — Somali (merged)

Somali ASR model: Qwen/Qwen3-ASR-1.7B fine-tuned (LoRA, merged) for Somali. The base model does not know Somali (emits Arabic-script output); this model transcribes Somali in Latin script and is vLLM-servable (unlike Wav2Vec2-CTC models such as MMS). LoRA adapter: chrullis/Qwen3-ASR-1.7B-Somali-LoRA.

Results (FLEURS Somali test + held-out sets; WER / CER)

Model	FLEURS (clean)	DDD-Kenya (held-out, diverse)	afri-voices (held-out, naturalistic)	vLLM-servable
base Qwen3-ASR-1.7B	1.008 (Arabic script)	—	—	yes
MMS-1b-all	0.453	—	—	no
This model	0.552 / 0.189	0.385 / 0.129	1.05–1.14 / 0.59–0.70	yes

On held-out, diverse, real-domain Somali (DDD-Kenya) it beats MMS (0.385 vs 0.453 WER) — genuine generalization, not memorization. CER ≪ WER because Somali Latin spelling varies.

Honest limitations

In-distribution-dependent, not robust. Strong on clean/read & DDD-domain Somali; degrades and can loop (repetition collapse) on harder/naturalistic audio (~40% of afri-voices clips).
Best-case CER ~0.13–0.19 — useful, not fluent. Small eval sets (n=20–120).

Recommended decode (already set in generation_config)

no_repeat_ngram_size=3, repetition_penalty=1.0. (no_repeat helps modestly on hard audio; a repetition_penalty > 1 backfires into hallucination — verified.)

Usage

from qwen_asr import Qwen3ASRModel
m = Qwen3ASRModel.from_pretrained("chrullis/Qwen3-ASR-1.7B-Somali", dtype="bfloat16", device_map="cuda")
print(m.transcribe(audio=("clip.wav"), language="so")[0].text)

Training data

FLEURS so_so + shunyalabs/somali-speech-dataset + badrex/afri-voices-somali-speech + DDD-Kenya/Somali-ASR-Subset-68H (4 shards). ~5,900 clips, LoRA r=16 all-linear, 3,500 steps.

Use it

Install the Qwen3-ASR library:

pip install -U qwen-asr

Transcribe (merged model — recommended):

from qwen_asr import Qwen3ASRModel

model = Qwen3ASRModel.from_pretrained(
    "chrullis/Qwen3-ASR-1.7B-Somali",
    dtype="bfloat16", device_map="cuda:0", max_new_tokens=128,
)
result = model.transcribe(audio="clip.wav", language="so")  # path, URL, or (np.ndarray, sr)
print(result[0].text)

Using the LoRA adapter directly

Most users should just use the merged model above. Use the LoRA only if you want to merge it yourself (or stack it onto another checkpoint). Full, runnable merge:

import torch, types
import torch.nn as nn
from peft import PeftModel
from qwen_asr import Qwen3ASRModel
from qwen_asr.core.transformers_backend.modeling_qwen3_asr import Qwen3ASRForConditionalGeneration
from qwen_asr.core.transformers_backend.processing_qwen3_asr import Qwen3ASRProcessor

BASE = "Qwen/Qwen3-ASR-1.7B"
ADAPTER = "chrullis/Qwen3-ASR-1.7B-Somali-LoRA"

# 1) load base, apply + merge the LoRA into the decoder ("thinker")
top = Qwen3ASRForConditionalGeneration.from_pretrained(BASE, dtype=torch.bfloat16)
tok_emb = max((m for m in top.thinker.modules() if isinstance(m, nn.Embedding)),
              key=lambda e: e.num_embeddings)
top.thinker.get_input_embeddings = types.MethodType(lambda self: tok_emb, top.thinker)  # qwen-asr quirk
top.thinker = PeftModel.from_pretrained(top.thinker, ADAPTER).merge_and_unload()

# 2) keep decoding greedy (sampling makes transcribe ramble), then save a usable model dir
gc = top.generation_config
gc.do_sample = False; gc.temperature = None; gc.top_p = None; gc.top_k = None
gc.no_repeat_ngram_size = 3
top.save_pretrained("somali_merged")
Qwen3ASRProcessor.from_pretrained(BASE).save_pretrained("somali_merged")

# 3) load the merged dir for transcription
model = Qwen3ASRModel.from_pretrained("somali_merged", dtype="bfloat16",
                                      device_map="cuda:0", max_new_tokens=128)
print(model.transcribe(audio="clip.wav", language="so")[0].text)

Recommended decode: no_repeat_ngram_size=3, repetition_penalty=1.0 (already set in this repo's generation_config.json). A repetition_penalty > 1 backfires into hallucination — do not raise it.

Host it (vLLM, OpenAI-compatible)

Qwen3-ASR serves under vLLM via the OpenAI /v1/audio/transcriptions endpoint:

pip install vllm   # install on an ext4 filesystem; some encrypted homes break flashinfer's long filenames
vllm serve chrullis/Qwen3-ASR-1.7B-Somali \
  --served-model-name qwen3-asr-somali \
  --dtype float16 --gpu-memory-utilization 0.80 --max-model-len 4096

Query it (any OpenAI-compatible client):

curl -s http://localhost:8000/v1/audio/transcriptions \
  -F file=@clip.wav -F model=qwen3-asr-somali -F language=so -F response_format=json

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="x")
print(client.audio.transcriptions.create(model="qwen3-asr-somali",
        file=open("clip.wav","rb"), language="so").text)

Audio must be 16 kHz mono (resample first if needed). The model is small (1.7B, ~4 GB fp16) and fits an 8 GB GPU.

Downloads last month: 3

Safetensors

Model size

2B params

Tensor type

BF16

Model tree for chrullis/Qwen3-ASR-1.7B-Somali

Base model

Qwen/Qwen3-ASR-1.7B

Finetuned

(63)

this model