Whisper-Tiny-MLA (11 languages) — MLA-converted, 62.5% smaller decode KV-cache

The on-device-tier sibling of the WhisperMLA family: openai/whisper-tiny (39M) with its decoder self-attention converted MHA→MLA (per Whisper-MLA, arXiv:2603.00563), recovery-fine-tuned on 11 languages of the CC0 Whispered corpus (32k clips/lang).

from transformers import AutoModelForSpeechSeq2Seq
model = AutoModelForSpeechSeq2Seq.from_pretrained("burakaydinofficial/whisper-tiny-mla-cv11", trust_remote_code=True)  # transformers==4.46.x

Honest sizing note (read this first)

Conversion cost grows as the student shrinks — measured across the family: small ≈ +0.4 median WER → base ≈ +1.0 → tiny ≈ +1.9. At the tiny tier you pay ≈ +1.9 WER (median) for the 62.5% cache cut. If quality is the priority, prefer the small variant; this tier is for memory-constrained deployments where the cache cut matters most.

Results (CommonVoice-17 test, n=1500/lang; WER/CER %; cost = paired vs an identically-trained unconverted control)

Lang this model (WER / CER) conversion cost
en 29.1 / 15.8 +2.41 ✱
de 42.8 / 16.6 +1.95 ✱
es 29.4 / 10.7 +1.07 ✱
fr 45.4 / 20.1 +2.05 ✱
ru 42.9 / 14.2 +3.32 ✱
tr 53.7 / 17.6 +2.63 ✱
cy 86.5 / 38.0 −0.19 (ns)
ar 67.5 / 28.7 +1.58 ✱
th 65.8 / 28.1 +0.75 CER ✱
zh 99.3 / 34.1 −1.49 CER (ns)
ka 122.1 / 81.0 −0.50 (ns) — floor

Absolute quality is tiny-tier-typical (much lower than small — that is the base model, not MLA). Encoder frozen both arms; 15,000 steps; warmup+cosine; fp16.

Limitations

Same as the flagship: transformers==4.46.x + trust_remote_code required; not loadable in whisper.cpp/faster-whisper/CT2; coverage = these 11 languages (unseen scripts degrade); ka reported as the labeled model-class floor; read-speech domain.

Acoustic conditions of the evaluation

Evaluated on crowdsourced consumer-microphone recordings with real environmental noise — traffic, room reverb, variable devices — CommonVoice's native conditions, not studio audio. The numbers above already include that heterogeneity. Not yet benchmarked: far-field, telephony (8 kHz), overlapping speech; an SNR-ladder robustness section will be added when measured.

Downloads last month
6
Safetensors
Model size
37.2M params
Tensor type
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for burakaydinofficial/whisper-tiny-mla-cv11

Finetuned
(1866)
this model

Dataset used to train burakaydinofficial/whisper-tiny-mla-cv11

Collection including burakaydinofficial/whisper-tiny-mla-cv11

Paper for burakaydinofficial/whisper-tiny-mla-cv11