Qwen3-ASR-1.7B-JA — Speaker-Attributed ASR via Projector + Speaker-Embedding Fine-Tuning

Joint transcription + 2-speaker diarization for Japanese telephone conversations. The model outputs a transcript with inline speaker tags <|spk_0|> / <|spk_1|>.

Fine-tuned from neosophie/Qwen3-ASR-1.7B-JA.

日本語概要: Qwen3-ASR (Speech LLM) に話者分離を追加したモデル。要点は LLM と音声エンコーダを完全に凍結し、projector と新規話者トークンの埋め込みだけを学習 したこと。LoRA を LLM にかける一般的な手法よりドメイン外で壊れにくい。

Method — the key idea

Qwen3-ASR is a Speech LLM: [audio encoder] → [projector] → [LLM decoder]. To add diarization we register new speaker tokens <|spk_0|> … <|spk_N|> and teach the model to emit them.

The central finding of this work:

Freeze the entire LLM decoder and the audio encoder. Train ONLY (a) the projector and (b) the embedding rows of the new speaker tokens.

Why this is enough. Diarization needs only two things the base model lacks:

Routing the speaker/acoustic information the encoder already captures — handled by the projector (it re-shapes audio features into the LLM space so the LLM can place speaker boundaries).
The meaning of the new speaker tokens — handled by training only their embedding rows.

Neither requires rewriting the LLM's reasoning or the encoder's acoustic modeling. So:

Component	State	Reason
Audio encoder	❄️ frozen	already encodes who is speaking
LLM decoder	❄️ frozen	preserves general ASR / Japanese ability
Projector	✅ trained	re-shapes features so the LLM can mark speaker turns
**New `<	spk_N	>` embeddings**
norms / biases / existing embeddings	❄️ frozen	—

Why not LoRA? (this approach was better)

The common recipe is LoRA on the LLM decoder. We tried it first. It did learn diarization in-domain, but on out-of-domain audio (e.g. clean wide-band / non-telephone speech) it catastrophically degenerated: it collapsed into degenerate loops (literally counting 1, 2, 3, … 1000) and leaked control tokens, because LoRA perturbs the LLM and damages its general language ability.

The projector + speaker-embedding approach (LLM frozen) does not degrade: on the same out-of-domain audio it still produces clean transcription with speaker tags, because the LLM's general ability is left untouched.

Same in-domain diarization quality, far better out-of-domain robustness — with a much smaller trainable set.

Usage

# pip install qwen-asr
import torch, librosa
from qwen_asr import Qwen3ASRModel

model = Qwen3ASRModel.from_pretrained(
    "okadahiroaki/Qwen3asr_diarization_projembed",
    dtype=torch.bfloat16, device_map="cuda:0",
)
audio, _ = librosa.load("call.wav", sr=16000, mono=True)
out = model.transcribe(audio=(audio.astype("float32"), 16000), language="Japanese")[0].text
print(out)
# <|spk_0|> はい、お電話ありがとうございます。 <|spk_1|> すいません、… <|spk_0|> …

This is a full model checkpoint (not a LoRA adapter) — it loads directly, no adapter merge needed.

Output format

<|spk_0|> <utterance> <|spk_1|> <utterance> <|spk_0|> <utterance> ...

Speaker tags are positional (<|spk_0|> = first detected speaker), not fixed identities, and alternate by speaker.

Training

Base: neosophie/Qwen3-ASR-1.7B-JA
Trainable: projector (proj1/proj2) + new <|spk_N|> embedding rows only; everything else frozen
Optimizer: AdamW, lr 2e-4, 1 epoch
Data: Japanese telephone conversation audio (2-speaker) with speaker-attributed transcripts. The training data is private (contains personal information) and is not released.

Limitations

Trained on 2-speaker telephone conversations; optimized for narrow-band telephone audio.
3 or more speakers is unverified. Speaker tokens up to <|spk_7|> exist, but only the 2-speaker case has been validated.
Japanese only.
Speaker labels are positional, not persistent speaker identities.

License

Derived from neosophie/Qwen3-ASR-1.7B-JA (Apache-2.0), itself based on Qwen/Qwen3-ASR-1.7B.

Downloads last month: 24

Safetensors

Model size

2B params

Tensor type

BF16

Model tree for okadahiroaki/Qwen3asr_diarization_projembed

Base model

Qwen/Qwen3-ASR-1.7B

Finetuned

neosophie/Qwen3-ASR-1.7B-JA

Finetuned

(1)

this model