Broad-Arabic Phoneme Zipformer (streaming, encoder + CTC head)

A streaming phoneme recognizer for general Arabic, both MSA and dialect. It maps audio to a stream of phonemes (consonant + ḥaraka units) that reflect what was actually pronounced, including dialectal forms. It does not normalise speech back to MSA. This is the broad-Arabic pre-training model that the Qur'an phoneme zipformer is fine-tuned from, published on its own for phonetic transcription, pronunciation/dialect analysis, and as a pre-training checkpoint.

  • Architecture: Zipformer2 (icefall/k2), causal streaming, 65.5M params, CTC over a 250-symbol phoneme inventory.
  • Output: phoneme units only. No text decoder is included, this repo is intentionally phonemes-only.
  • Training: about 1700 h of Arabic speech (MASC, SADA, MGB2, CommonVoice, ArVoice, FLEURS), 5 epochs, phoneme-CTC.

Phoneme labels: LLM-phonemized, source-aware, dialect-preserving

The broad-Arabic phoneme targets were produced by an LLM phonemizer with source-aware prompting (a prompt tailored to each corpus / register, since spontaneous dialect, broadcast MSA and read MSA phonemize differently). A deliberate goal, and one we spot-checked, was to retain dialectal information: the labels (and therefore the model) capture the actual dialectal pronunciation rather than collapsing it to Modern Standard Arabic. Because it is LLM-generated, the labelling carries some noise (a minority of clips show vowel-length artifacts). The Qur'an fine-tune is sharper because Qur'an phonemes are deterministic (rule-based).

What it does (spot-check)

Predicted phonemes vs reference text (the model output is the PHON line).

Dialect (Moroccan Darija): it keeps the dialect, e.g. the negation circumfix ما…ش and dialectal lexemes.

ref  متخافيش يوجع غير شوية          phon  مَتَخَاافِۦۦش يُوعفِۦۦ ل شُوَييَت     ("ma-txafi-š ... šwayya")
ref  ماتعاونينيش يزوجوني بسيف        phon  مَتَءَوِيَنِۦۦش يَزَوجُۥۥنِۦۦ بَسيَف   (keeps the "-niš" suffix)

MSA (read): faithful phonetics. Sun-letter assimilation, gemination, and long vowels are all there.

ref  يتمُّ دعم التعلُّم التفاعليّ      phon  يَتِمم دَاامُ ل تتَعَللُم ل تتَفَااعُلِي   (gemination "taʿallum")
ref  تشكلت في المحيط الأطلسي         phon  تَشَككَت فِۦۦ ل مَمُحِۦۦط ِل ءَطلَااصِي

It is a good, honest phonetic transcriber: the words and their pronunciation come through clearly in both registers, with occasional minor slips (a dropped or merged consonant here and there).

Good for

  • Phonetic / dialect transcription: what was actually pronounced, at the phoneme level.
  • Pronunciation-assessment front-end: compare predicted phonemes to a reference.
  • Pre-training: it carries broad Arabic acoustic knowledge. Fine-tuning on a deterministically-labelled domain (Qur'an) yields a sharp model, which is exactly how the Qur'an model was built.

Note on text / WER

This repo is the encoder only: no seq2seq, no text decoder. Phonemes to text is a separate, hard problem for open-vocabulary Arabic, so this model should not be judged by word-level text WER, which measures a decoder rather than this encoder. If you need Arabic text, use a text ASR. If you need Arabic phonemes / pronunciation (and dialect-faithful at that), use this.

Files

file what
arabic_phoneme_zipformer.pt model weights (65.5M, inference-only) plus blank_id
phoneme_units.json the 250-unit phoneme tokenizer (greedy longest-match)

Usage

icefall Zipformer2, 80-bin kaldi fbank (povey) at 16 kHz, CTC-greedy decode to phoneme units. Same inference path as the Qur'an model: see quran_per_eval.py in that repo for a runnable example (swap the checkpoint). For the 1000 ms look-ahead profile use chunk_size=(24,), left_context_frames=(256,).

License

Apache-2.0.

Related benchmark

This broad encoder is the pre-training base for zipformer_p-quran, which ranks #1 among streaming models on the leakage-free Quranic ASR Leaderboard (dataset: Quran-Lab/quranic-asr-benchmark). This phoneme encoder itself is for general Arabic and is not on that Qur'an board.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Muno459/zipformer_p-arabic

Quantizations
1 model

Dataset used to train Muno459/zipformer_p-arabic