Broad-Arabic Phoneme Zipformer (streaming, encoder + CTC head)
A streaming phoneme recognizer for general Arabic, both MSA and dialect. It maps audio to a stream of phonemes (consonant + ḥaraka units) that reflect what was actually pronounced, including dialectal forms. It does not normalise speech back to MSA. This is the broad-Arabic pre-training model that the Qur'an phoneme zipformer is fine-tuned from, published on its own for phonetic transcription, pronunciation/dialect analysis, and as a pre-training checkpoint.
- Architecture: Zipformer2 (icefall/k2), causal streaming, 65.5M params, CTC over a 250-symbol phoneme inventory.
- Output: phoneme units only. No text decoder is included, this repo is intentionally phonemes-only.
- Training: about 1700 h of Arabic speech (MASC, SADA, MGB2, CommonVoice, ArVoice, FLEURS), 5 epochs, phoneme-CTC.
Phoneme labels: LLM-phonemized, source-aware, dialect-preserving
The broad-Arabic phoneme targets were produced by an LLM phonemizer with source-aware prompting (a prompt tailored to each corpus / register, since spontaneous dialect, broadcast MSA and read MSA phonemize differently). A deliberate goal, and one we spot-checked, was to retain dialectal information: the labels (and therefore the model) capture the actual dialectal pronunciation rather than collapsing it to Modern Standard Arabic. Because it is LLM-generated, the labelling carries some noise (a minority of clips show vowel-length artifacts). The Qur'an fine-tune is sharper because Qur'an phonemes are deterministic (rule-based).
What it does (spot-check)
Predicted phonemes vs reference text (the model output is the PHON line).
Dialect (Moroccan Darija): it keeps the dialect, e.g. the negation circumfix ما…ش and dialectal lexemes.
ref متخافيش يوجع غير شوية phon مَتَخَاافِۦۦش يُوعفِۦۦ ل شُوَييَت ("ma-txafi-š ... šwayya")
ref ماتعاونينيش يزوجوني بسيف phon مَتَءَوِيَنِۦۦش يَزَوجُۥۥنِۦۦ بَسيَف (keeps the "-niš" suffix)
MSA (read): faithful phonetics. Sun-letter assimilation, gemination, and long vowels are all there.
ref يتمُّ دعم التعلُّم التفاعليّ phon يَتِمم دَاامُ ل تتَعَللُم ل تتَفَااعُلِي (gemination "taʿallum")
ref تشكلت في المحيط الأطلسي phon تَشَككَت فِۦۦ ل مَمُحِۦۦط ِل ءَطلَااصِي
It is a good, honest phonetic transcriber: the words and their pronunciation come through clearly in both registers, with occasional minor slips (a dropped or merged consonant here and there).
Good for
- Phonetic / dialect transcription: what was actually pronounced, at the phoneme level.
- Pronunciation-assessment front-end: compare predicted phonemes to a reference.
- Pre-training: it carries broad Arabic acoustic knowledge. Fine-tuning on a deterministically-labelled domain (Qur'an) yields a sharp model, which is exactly how the Qur'an model was built.
Note on text / WER
This repo is the encoder only: no seq2seq, no text decoder. Phonemes to text is a separate, hard problem for open-vocabulary Arabic, so this model should not be judged by word-level text WER, which measures a decoder rather than this encoder. If you need Arabic text, use a text ASR. If you need Arabic phonemes / pronunciation (and dialect-faithful at that), use this.
Files
| file | what |
|---|---|
arabic_phoneme_zipformer.pt |
model weights (65.5M, inference-only) plus blank_id |
phoneme_units.json |
the 250-unit phoneme tokenizer (greedy longest-match) |
Usage
icefall Zipformer2, 80-bin kaldi fbank (povey) at 16 kHz, CTC-greedy decode to phoneme units. Same
inference path as the Qur'an model: see quran_per_eval.py in that repo for a runnable example
(swap the checkpoint). For the 1000 ms look-ahead profile use chunk_size=(24,), left_context_frames=(256,).
License
Apache-2.0.
Related benchmark
This broad encoder is the pre-training base for zipformer_p-quran, which ranks #1 among streaming models on the leakage-free Quranic ASR Leaderboard (dataset: Quran-Lab/quranic-asr-benchmark). This phoneme encoder itself is for general Arabic and is not on that Qur'an board.