Whisper-Tiny-MLA (11 languages) — MLA-converted, 62.5% smaller decode KV-cache

The on-device-tier sibling of the WhisperMLA family: openai/whisper-tiny (39M) with its decoder self-attention converted MHA→MLA (per Whisper-MLA, arXiv:2603.00563), recovery-fine-tuned on 11 languages of the CC0 Whispered corpus (32k clips/lang).

from transformers import AutoModelForSpeechSeq2Seq
model = AutoModelForSpeechSeq2Seq.from_pretrained("burakaydinofficial/whisper-tiny-mla-cv11", trust_remote_code=True)  # transformers==4.46.x

Honest sizing note (read this first)

Conversion cost grows as the student shrinks — measured across the family: small ≈ +0.4 median WER → base ≈ +1.0 → tiny ≈ +1.9. At the tiny tier you pay ≈ +1.9 WER (median) for the 62.5% cache cut. If quality is the priority, prefer the small variant; this tier is for memory-constrained deployments where the cache cut matters most.

Results (CommonVoice-17 test, n=1500/lang; WER/CER %; cost = paired vs an identically-trained unconverted control)

Lang	this model (WER / CER)	conversion cost
en	29.1 / 15.8	+2.41 ✱
de	42.8 / 16.6	+1.95 ✱
es	29.4 / 10.7	+1.07 ✱
fr	45.4 / 20.1	+2.05 ✱
ru	42.9 / 14.2	+3.32 ✱
tr	53.7 / 17.6	+2.63 ✱
cy	86.5 / 38.0	−0.19 (ns)
ar	67.5 / 28.7	+1.58 ✱
th	65.8 / 28.1	+0.75 CER ✱
zh	99.3 / 34.1	−1.49 CER (ns)
ka	122.1 / 81.0	−0.50 (ns) — floor

Absolute quality is tiny-tier-typical (much lower than small — that is the base model, not MLA). Encoder frozen both arms; 15,000 steps; warmup+cosine; fp16.

Limitations

Same as the flagship: transformers==4.46.x + trust_remote_code required; not loadable in whisper.cpp/faster-whisper/CT2; coverage = these 11 languages (unseen scripts degrade); ka reported as the labeled model-class floor; read-speech domain.

Acoustic conditions of the evaluation

Evaluated on crowdsourced consumer-microphone recordings with real environmental noise — traffic, room reverb, variable devices — CommonVoice's native conditions, not studio audio. The numbers above already include that heterogeneity. Not yet benchmarked: far-field, telephony (8 kHz), overlapping speech; an SNR-ladder robustness section will be added when measured.