Usage agreement: free, non-commercial use only

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

This model is shared freely for the sake of Allah. By requesting access you agree: (1) you may use, fine-tune and redistribute it and its outputs ONLY in applications that are FREE to end users; (2) you may NOT sell it, place it behind a paid subscription or paywall, monetize it with ads, or earn any revenue from an app or service that uses this model or its outputs; (3) these terms pass on to anyone you share it with. وَما أَسأَلُكُم عَلَيهِ مِن أَجرٍ إِن أَجرِيَ إِلّا عَلىٰ رَبِّ العالَمِين

🕌 Quran Phoneme Zipformer · streaming · tajwīd-aware

A 65M streaming recognizer that hears what the reciter actually said (phonemes), so it can catch recitation mistakes a normal ASR quietly hides.

🥇 #1 streaming model on the leakage-free Quranic ASR Leaderboard · 5.83 WER, about 2× better than the next streaming system, and closing in on the best offline model (4.13).

🏆 Where it lands (official Quran-Lab board, 600 held-out clips, same scorer for everyone)

#	model	WER ⬇️	mode
1	fastconformer-quran	4.13	offline
2	➡️ this model	5.83	🟢 streaming
3	mohammed fastconformer	6.95	offline
4	nvidia fastconformer	8.14	offline
5	Tarteel (official)	10.99	streaming
6	whisper-large-v3	11.65	offline
7	fastconformer-quran	11.96	streaming

🔗 Live board: Quranic ASR Leaderboard · 📦 dataset + scorer: Quran-Lab/quranic-asr-benchmark

Best streaming model on the board, and the gap to the offline #1 is mostly the simple decoder (see Why retrieval), not the acoustics.

🎯 The point: it catches mistakes

A normal Qur'an ASR outputs text and leans on a language model that "knows" the Qur'an, so when a reciter slips it auto-corrects the output back to the expected āyah and the mistake vanishes. Great for transcription, useless for teaching.

This model has no LM and emits phonemes, so the output is what was really said. Compare it to the canonical phonemes of the intended āyah (quran_text2phoneme.json, deterministic Ḥafṣ) and the deviations pop out:

Intended (canonical):   مَالِكِ يَومِ ددِين      maaliki yawmi d-dīn
Reciter said:           مَلِكِ يَومِ ددِين       maliki  yawmi d-dīn   (madd dropped)
Model output:           مَلِكِ ...                                      (matches what was said)
Flag:                   مَالِكِ vs مَلِكِ          ⚠ madd omitted on "مالك"

What surfaces: ✗ ḥarakāt errors · ✗ madd length · ✗ wrong/missing/extra letters · ✗ tajwīd rules (idghām, ikhfāʾ, qalqala) not applied. Correction-app flow: audio → model → phonemes → align vs canonical → flag.

📊 Phoneme accuracy (PER, vs deterministic Ḥafṣ gold)

source	PER
everyayah (clean)	7.69%
QUL (unseen reciter)	14.97%
tlog (real phone-mic)	13.49%
overall	11.63%

🎚️ Look-ahead barely matters (genuinely streaming-robust), overall PER on the same 600 clips:

look-ahead	chunk	PER
320 ms	8	12.06%
640 ms	16	11.74%
1000 ms	24	11.63%
offline	128	11.66%

Even aggressive 320 ms streaming is within ~0.4 pts of full context, and going past the trained 1000 ms profile does not help. Treat PER as the true accuracy signal.

🧠 Why retrieval (not free text decoding)?

The model emits phonemes, not words, so to score word-WER at all the phonemes must become text, and there is no free text decoder here. Since the Qur'an is a closed, fixed text, the faithful mapping is to match the predicted phoneme string to the nearest canonical āyah and read off its text. So retrieval is not a trick to inflate the score, it is the intrinsic decode step for a phoneme model on a closed corpus (labeled (retrieval) on the board for honesty). Most of the gap to 4.13 is the naive retrieval picking a similar but wrong short āyah, so the acoustics are better than the WER alone suggests, and a smarter decoder is the lever to go lower.

📦 Files

file	what
`quran_phoneme_zipformer.pt`	PyTorch weights (65.5M, inference-only) + `blank_id`
`quran_phoneme_zipformer.onnx`	cache-aware streaming CTC ONNX (sherpa-onnx compatible)
`quran_phoneme_zipformer.int8.onnx`	INT8 streaming ONNX (73 MB, dynamic-quantized MatMul)
`scripts/export_quran_streaming_onnx.py`	the streaming ONNX + int8 export script
`phoneme_units.json`	250-unit phoneme tokenizer
`quran_text2phoneme.json`	canonical text→phoneme table (mistake-detection / retrieval)
`scripts/zipformer_rnnt_ctc_train.py`	`build_model` + `PhonemeUnitTokenizer` (needed to load/export)
`scripts/zipformer_rnnt_ctc_eval.py`	`greedy_ctc_decode`
`quran_per_eval.py`, `quran_wer_retrieval.py`	reproduce PER / board WER

🚀 Usage

icefall Zipformer2. Build with build_model from the included scripts/zipformer_rnnt_ctc_train.py, load the ["model"] weights, feed 80-bin kaldi fbank (povey) @ 16 kHz, decode CTC greedily. For the 1000 ms profile set chunk_size=(24,), left_context_frames=(256,). You also need k2-fsa/icefall cloned (for zipformer/scaling/subsampling).

from scripts.zipformer_rnnt_ctc_train import build_model, load_tokenizer
import torch
tok = load_tokenizer("phoneme_units.json"); blank = tok.get_piece_size()
model = build_model(blank + 1, blank, chunk_frames=[24], left_context_frames=256).eval()
model.load_state_dict(torch.load("quran_phoneme_zipformer.pt", map_location="cpu")["model"])
# enc, lens = model.encode(feats, feat_lens);  ids = model.ctc_head(enc).argmax(-1)

quran_per_eval.py is a complete runnable example.

ONNX (cache-aware streaming, sherpa-onnx ready): the ONNX is a chunk-by-chunk streaming zipformer2-CTC graph. Inputs are one chunk x (1,T,80) of 80-bin kaldi fbank plus the encoder cache-state tensors (cached_key/nonlin_attn/val1/val2/conv1/conv2 per layer, embed_states, processed_lens); outputs are log_probs plus the updated new_* states you feed into the next chunk. All the streaming params (decode_chunk_len, T, left_context_len, layer dims, etc.) are embedded as ONNX metadata, so sherpa-onnx configures it automatically. No PyTorch or icefall needed at inference.

🗂 Training data

Fine-tuned from Muno459/zipformer_p-arabic (broad-Arabic phoneme encoder, ~1700 h) on Qur'an recitation:

Muno459/quran-phonemes: TarteelAI everyayah + tlog, with phoneme labels.
muaalem (obadx/muaalem-annotated-v3): gold phoneme/ṣifāt annotations (labels referenced, not redistributed).

Qur'an phoneme targets are deterministic (quran-transcript / quran_phonetizer, Ḥafṣ, madd 4/4/4/4), not LLM-generated, which is why this model is sharp.

⚠️ Limitations

Phoneme output, not text (pair with retrieval or a decoder). Tuned for Ḥafṣ; other qirāʾāt out of scope. Board WER depends on the retrieval decoder; PER is the cleaner accuracy signal.

📜 License & usage agreement (gated)

Shared freely for the sake of Allah, under a free / non-commercial term, not Apache-2.0. Access is gated: you must agree to use this model and its outputs only in apps that are free to end users. No selling, no paid subscriptions or paywalls, no ad revenue, no monetization of any kind, and you pass these terms on. See LICENSE.

وَما أَسأَلُكُم عَلَيهِ مِن أَجرٍ إِن أَجرِيَ إِلّا عَلىٰ رَبِّ العالَمِين

🔬 Details (model · training · evaluation)

For reviewers: exactly how this model is built, trained, and measured.

Model

Encoder: icefall/k2 Zipformer2, 6 stacks, downsampling_factor=(1,2,4,8,4,2), encoder_dim=(192,256,384,512,384,256), num_heads=(4,4,4,8,4,4), cnn_module_kernel=(31,31,15,15,15,31), output downsampling 2. Causal (streaming), chunk_size=24, left_context_frames=256 for the 1000 ms profile. ~65.5M params.
Front end: 80-bin kaldi fbank (povey window, 25 ms / 10 ms, 16 kHz), computed the same in train and eval.
Head: single CTC head (ScaledLinear, 250 phoneme units + blank). No RNN-T, no language model (deliberate: an LM would auto-correct reciter mistakes and hide tajwīd errors).
Tokenizer: 250 phoneme units (consonant+ḥaraka units), greedy longest-match segmentation; reproduces the muaalem ṣifāt segmentation with zero unknowns.

Training

Broad-Arabic pre-training → Muno459/zipformer_p-arabic: ~1700 h (MASC, SADA, MGB2, CommonVoice, ArVoice, FLEURS), phoneme-CTC, 5 epochs, ScaledAdam + Eden LR. Broad floor ctc ≈ 1.71 (open-domain difficulty).
Qur'an fine-tune (this model): --init-from the broad encoder (weights only, fresh optimizer), CTC-only, lr 0.02, batch 16, max-dur 15 s, 3 epochs. Converged to train ctc ≈ 0.065.

Labels are deterministic, not LLM: Qur'an phonemes come from quran-transcript / quran_phonetizer (Ḥafṣ, madd 4/4/4/4), plus muaalem gold annotations. Sources: everyayah + tlog + muaalem. This determinism is why the Qur'an model is sharp.

Evaluation

Benchmark: the official, leakage-free Quran-Lab/quranic-asr-benchmark: 600 held-out clips, 200 each from everyayah (clean), QUL/Al-Nufais (unseen reciter), tlog (real phone-mic). Every clip verified not in training.
PER (primary signal): predicted phoneme units vs deterministic Ḥafṣ gold, unit-level edit distance (scripts/quran_per_eval.py). Overall 11.63% (everyayah 7.69 / QUL 14.97 / tlog 13.49). Madd-insensitive PER ≈ identical (9.59 vs 9.64), confirming the errors are real phonemes, not vowel-length artifacts.
Look-ahead sweep: PER 10.1% @ 320 ms (chunk 8) → 9.8% @ 640 ms → 9.6% @ 1000 ms (chunk 24); offline (chunk 128) does not improve, so it is genuinely streaming-robust.
WER (board metric): the model emits phonemes, so for word-WER each prediction is mapped to the nearest canonical āyah (rapidfuzz over the 9.1k-āyah text→phoneme table) and scored with the official score.py (scripts/quran_wer_retrieval.py). Overall 5.83 WER / 3.55 CER / 4.59 alef-insensitive. This is a Qur'an-lexicon-constrained decode (labeled (retrieval) on the board); it is not free text decoding like the other entries.
Honest reading: treat PER as the model's true accuracy. The board WER is "phonemes + a simple lexicon"; most of the gap to the 4.13 offline model is the naive retrieval picking a near-identical wrong short āyah, not the acoustics. A smarter phoneme→text decoder is the obvious next lever.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for Muno459/zipformer_p-quran

Base model

Muno459/zipformer_p-arabic

Quantized

(1)

this model

Muno459
/

zipformer_p-quran