Usage agreement: free, non-commercial use only
This repository is publicly accessible, but you have to accept the conditions to access its files and content.
This model is shared freely for the sake of Allah. By requesting access you agree: (1) you may use, fine-tune and redistribute it and its outputs ONLY in applications that are FREE to end users; (2) you may NOT sell it, place it behind a paid subscription or paywall, monetize it with ads, or earn any revenue from an app or service that uses this model or its outputs; (3) these terms pass on to anyone you share it with. وَما أَسأَلُكُم عَلَيهِ مِن أَجرٍ إِن أَجرِيَ إِلّا عَلىٰ رَبِّ العالَمِين
Log in or Sign Up to review the conditions and access this model content.
- 🕌 Quran Phoneme Zipformer · streaming · tajwīd-aware
- 🏆 Where it lands (official Quran-Lab board, 600 held-out clips, same scorer for everyone)
- 🎯 The point: it catches mistakes
- 📊 Phoneme accuracy (PER, vs deterministic Ḥafṣ gold)
- 🧠 Why retrieval (not free text decoding)?
- 📦 Files
- 🚀 Usage
- 🗂 Training data
- ⚠️ Limitations
- 📜 License & usage agreement (gated)
- 🔬 Details (model · training · evaluation)
- 🏆 Where it lands (official Quran-Lab board, 600 held-out clips, same scorer for everyone)
🕌 Quran Phoneme Zipformer · streaming · tajwīd-aware
A 65M streaming recognizer that hears what the reciter actually said (phonemes), so it can catch recitation mistakes a normal ASR quietly hides.
🥇 #1 streaming model on the leakage-free Quranic ASR Leaderboard · 5.83 WER, about 2× better than the next streaming system, and closing in on the best offline model (4.13).
🏆 Where it lands (official Quran-Lab board, 600 held-out clips, same scorer for everyone)
| # | model | WER ⬇️ | mode |
|---|---|---|---|
| 1 | fastconformer-quran | 4.13 | offline |
| 2 | ➡️ this model | 5.83 | 🟢 streaming |
| 3 | mohammed fastconformer | 6.95 | offline |
| 4 | nvidia fastconformer | 8.14 | offline |
| 5 | Tarteel (official) | 10.99 | streaming |
| 6 | whisper-large-v3 | 11.65 | offline |
| 7 | fastconformer-quran | 11.96 | streaming |
🔗 Live board: Quranic ASR Leaderboard · 📦 dataset + scorer: Quran-Lab/quranic-asr-benchmark
Best streaming model on the board, and the gap to the offline #1 is mostly the simple decoder (see Why retrieval), not the acoustics.
🎯 The point: it catches mistakes
A normal Qur'an ASR outputs text and leans on a language model that "knows" the Qur'an, so when a reciter slips it auto-corrects the output back to the expected āyah and the mistake vanishes. Great for transcription, useless for teaching.
This model has no LM and emits phonemes, so the output is what was really said. Compare it to the
canonical phonemes of the intended āyah (quran_text2phoneme.json, deterministic Ḥafṣ) and the deviations pop out:
Intended (canonical): مَالِكِ يَومِ ددِين maaliki yawmi d-dīn
Reciter said: مَلِكِ يَومِ ددِين maliki yawmi d-dīn (madd dropped)
Model output: مَلِكِ ... (matches what was said)
Flag: مَالِكِ vs مَلِكِ ⚠ madd omitted on "مالك"
What surfaces: ✗ ḥarakāt errors · ✗ madd length · ✗ wrong/missing/extra letters · ✗ tajwīd rules
(idghām, ikhfāʾ, qalqala) not applied. Correction-app flow: audio → model → phonemes → align vs canonical → flag.
📊 Phoneme accuracy (PER, vs deterministic Ḥafṣ gold)
| source | PER |
|---|---|
| everyayah (clean) | 7.69% |
| QUL (unseen reciter) | 14.97% |
| tlog (real phone-mic) | 13.49% |
| overall | 11.63% |
🎚️ Look-ahead barely matters (genuinely streaming-robust), overall PER on the same 600 clips:
| look-ahead | chunk | PER |
|---|---|---|
| 320 ms | 8 | 12.06% |
| 640 ms | 16 | 11.74% |
| 1000 ms | 24 | 11.63% |
| offline | 128 | 11.66% |
Even aggressive 320 ms streaming is within ~0.4 pts of full context, and going past the trained 1000 ms profile does not help. Treat PER as the true accuracy signal.
🧠 Why retrieval (not free text decoding)?
The model emits phonemes, not words, so to score word-WER at all the phonemes must become text, and there
is no free text decoder here. Since the Qur'an is a closed, fixed text, the faithful mapping is to match the
predicted phoneme string to the nearest canonical āyah and read off its text. So retrieval is not a trick to
inflate the score, it is the intrinsic decode step for a phoneme model on a closed corpus (labeled
(retrieval) on the board for honesty). Most of the gap to 4.13 is the naive retrieval picking a similar but
wrong short āyah, so the acoustics are better than the WER alone suggests, and a smarter decoder is the lever to go lower.
📦 Files
| file | what |
|---|---|
quran_phoneme_zipformer.pt |
PyTorch weights (65.5M, inference-only) + blank_id |
quran_phoneme_zipformer.onnx |
cache-aware streaming CTC ONNX (sherpa-onnx compatible) |
quran_phoneme_zipformer.int8.onnx |
INT8 streaming ONNX (73 MB, dynamic-quantized MatMul) |
scripts/export_quran_streaming_onnx.py |
the streaming ONNX + int8 export script |
phoneme_units.json |
250-unit phoneme tokenizer |
quran_text2phoneme.json |
canonical text→phoneme table (mistake-detection / retrieval) |
scripts/zipformer_rnnt_ctc_train.py |
build_model + PhonemeUnitTokenizer (needed to load/export) |
scripts/zipformer_rnnt_ctc_eval.py |
greedy_ctc_decode |
quran_per_eval.py, quran_wer_retrieval.py |
reproduce PER / board WER |
🚀 Usage
icefall Zipformer2. Build with build_model from the included scripts/zipformer_rnnt_ctc_train.py, load the
["model"] weights, feed 80-bin kaldi fbank (povey) @ 16 kHz, decode CTC greedily. For the 1000 ms profile set
chunk_size=(24,), left_context_frames=(256,). You also need k2-fsa/icefall cloned (for zipformer/scaling/subsampling).
from scripts.zipformer_rnnt_ctc_train import build_model, load_tokenizer
import torch
tok = load_tokenizer("phoneme_units.json"); blank = tok.get_piece_size()
model = build_model(blank + 1, blank, chunk_frames=[24], left_context_frames=256).eval()
model.load_state_dict(torch.load("quran_phoneme_zipformer.pt", map_location="cpu")["model"])
# enc, lens = model.encode(feats, feat_lens); ids = model.ctc_head(enc).argmax(-1)
quran_per_eval.py is a complete runnable example.
ONNX (cache-aware streaming, sherpa-onnx ready): the ONNX is a chunk-by-chunk streaming zipformer2-CTC graph. Inputs are one chunk x (1,T,80) of 80-bin kaldi fbank plus the encoder cache-state tensors (cached_key/nonlin_attn/val1/val2/conv1/conv2 per layer, embed_states, processed_lens); outputs are log_probs plus the updated new_* states you feed into the next chunk. All the streaming params (decode_chunk_len, T, left_context_len, layer dims, etc.) are embedded as ONNX metadata, so sherpa-onnx configures it automatically. No PyTorch or icefall needed at inference.
🗂 Training data
Fine-tuned from Muno459/zipformer_p-arabic (broad-Arabic phoneme encoder, ~1700 h) on Qur'an recitation:
- Muno459/quran-phonemes: TarteelAI everyayah + tlog, with phoneme labels.
- muaalem (obadx/muaalem-annotated-v3): gold phoneme/ṣifāt annotations (labels referenced, not redistributed).
Qur'an phoneme targets are deterministic (quran-transcript / quran_phonetizer, Ḥafṣ, madd 4/4/4/4), not
LLM-generated, which is why this model is sharp.
⚠️ Limitations
Phoneme output, not text (pair with retrieval or a decoder). Tuned for Ḥafṣ; other qirāʾāt out of scope. Board WER depends on the retrieval decoder; PER is the cleaner accuracy signal.
📜 License & usage agreement (gated)
Shared freely for the sake of Allah, under a free / non-commercial term, not Apache-2.0. Access is
gated: you must agree to use this model and its outputs only in apps that are free to end users. No selling,
no paid subscriptions or paywalls, no ad revenue, no monetization of any kind, and you pass these terms on. See LICENSE.
وَما أَسأَلُكُم عَلَيهِ مِن أَجرٍ إِن أَجرِيَ إِلّا عَلىٰ رَبِّ العالَمِين
🔬 Details (model · training · evaluation)
For reviewers: exactly how this model is built, trained, and measured.
Model
- Encoder: icefall/k2 Zipformer2, 6 stacks,
downsampling_factor=(1,2,4,8,4,2),encoder_dim=(192,256,384,512,384,256),num_heads=(4,4,4,8,4,4),cnn_module_kernel=(31,31,15,15,15,31), output downsampling 2. Causal (streaming),chunk_size=24,left_context_frames=256for the 1000 ms profile. ~65.5M params. - Front end: 80-bin kaldi fbank (povey window, 25 ms / 10 ms, 16 kHz), computed the same in train and eval.
- Head: single CTC head (
ScaledLinear, 250 phoneme units + blank). No RNN-T, no language model (deliberate: an LM would auto-correct reciter mistakes and hide tajwīd errors). - Tokenizer: 250 phoneme units (consonant+ḥaraka units), greedy longest-match segmentation; reproduces the muaalem ṣifāt segmentation with zero unknowns.
Training
- Broad-Arabic pre-training → Muno459/zipformer_p-arabic: ~1700 h (MASC, SADA, MGB2, CommonVoice, ArVoice, FLEURS), phoneme-CTC, 5 epochs, ScaledAdam + Eden LR. Broad floor ctc ≈ 1.71 (open-domain difficulty).
- Qur'an fine-tune (this model):
--init-fromthe broad encoder (weights only, fresh optimizer), CTC-only, lr 0.02, batch 16, max-dur 15 s, 3 epochs. Converged to train ctc ≈ 0.065.
- Labels are deterministic, not LLM: Qur'an phonemes come from
quran-transcript/quran_phonetizer(Ḥafṣ, madd 4/4/4/4), plus muaalem gold annotations. Sources: everyayah + tlog + muaalem. This determinism is why the Qur'an model is sharp.
Evaluation
- Benchmark: the official, leakage-free Quran-Lab/quranic-asr-benchmark: 600 held-out clips, 200 each from everyayah (clean), QUL/Al-Nufais (unseen reciter), tlog (real phone-mic). Every clip verified not in training.
- PER (primary signal): predicted phoneme units vs deterministic Ḥafṣ gold, unit-level edit distance (
scripts/quran_per_eval.py). Overall 11.63% (everyayah 7.69 / QUL 14.97 / tlog 13.49). Madd-insensitive PER ≈ identical (9.59 vs 9.64), confirming the errors are real phonemes, not vowel-length artifacts. - Look-ahead sweep: PER 10.1% @ 320 ms (chunk 8) → 9.8% @ 640 ms → 9.6% @ 1000 ms (chunk 24); offline (chunk 128) does not improve, so it is genuinely streaming-robust.
- WER (board metric): the model emits phonemes, so for word-WER each prediction is mapped to the nearest canonical āyah (rapidfuzz over the 9.1k-āyah text→phoneme table) and scored with the official
score.py(scripts/quran_wer_retrieval.py). Overall 5.83 WER / 3.55 CER / 4.59 alef-insensitive. This is a Qur'an-lexicon-constrained decode (labeled(retrieval)on the board); it is not free text decoding like the other entries. - Honest reading: treat PER as the model's true accuracy. The board WER is "phonemes + a simple lexicon"; most of the gap to the 4.13 offline model is the naive retrieval picking a near-identical wrong short āyah, not the acoustics. A smarter phoneme→text decoder is the obvious next lever.
Model tree for Muno459/zipformer_p-quran
Base model
Muno459/zipformer_p-arabic