Usage agreement: free, non-commercial use only

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

This model is shared freely for the sake of Allah. By requesting access you agree: (1) you may use, fine-tune and redistribute it and its outputs ONLY in applications that are FREE to end users; (2) you may NOT sell it, place it behind a paid subscription or paywall, monetize it with ads, or earn any revenue from an app or service that uses this model or its outputs; (3) these terms pass on to anyone you share it with. وَما أَسأَلُكُم عَلَيهِ مِن أَجرٍ إِن أَجرِيَ إِلّا عَلىٰ رَبِّ العالَمِين

Log in or Sign Up to review the conditions and access this model content.

🕌 Quran Phoneme Zipformer · streaming · tajwīd-aware

A 65M streaming recognizer that hears what the reciter actually said (phonemes), so it can catch recitation mistakes a normal ASR quietly hides.

🥇 #1 streaming model on the leakage-free Quranic ASR Leaderboard · 5.83 WER, about 2× better than the next streaming system, and closing in on the best offline model (4.13).


🏆 Where it lands (official Quran-Lab board, 600 held-out clips, same scorer for everyone)

# model WER ⬇️ mode
1 fastconformer-quran 4.13 offline
2 ➡️ this model 5.83 🟢 streaming
3 mohammed fastconformer 6.95 offline
4 nvidia fastconformer 8.14 offline
5 Tarteel (official) 10.99 streaming
6 whisper-large-v3 11.65 offline
7 fastconformer-quran 11.96 streaming

🔗 Live board: Quranic ASR Leaderboard · 📦 dataset + scorer: Quran-Lab/quranic-asr-benchmark

Best streaming model on the board, and the gap to the offline #1 is mostly the simple decoder (see Why retrieval), not the acoustics.


🎯 The point: it catches mistakes

A normal Qur'an ASR outputs text and leans on a language model that "knows" the Qur'an, so when a reciter slips it auto-corrects the output back to the expected āyah and the mistake vanishes. Great for transcription, useless for teaching.

This model has no LM and emits phonemes, so the output is what was really said. Compare it to the canonical phonemes of the intended āyah (quran_text2phoneme.json, deterministic Ḥafṣ) and the deviations pop out:

Intended (canonical):   مَالِكِ يَومِ ددِين      maaliki yawmi d-dīn
Reciter said:           مَلِكِ يَومِ ددِين       maliki  yawmi d-dīn   (madd dropped)
Model output:           مَلِكِ ...                                      (matches what was said)
Flag:                   مَالِكِ vs مَلِكِ          ⚠ madd omitted on "مالك"

What surfaces: ✗ ḥarakāt errors · ✗ madd length · ✗ wrong/missing/extra letters · ✗ tajwīd rules (idghām, ikhfāʾ, qalqala) not applied. Correction-app flow: audio → model → phonemes → align vs canonical → flag.


📊 Phoneme accuracy (PER, vs deterministic Ḥafṣ gold)

source PER
everyayah (clean) 7.69%
QUL (unseen reciter) 14.97%
tlog (real phone-mic) 13.49%
overall 11.63%

🎚️ Look-ahead barely matters (genuinely streaming-robust), overall PER on the same 600 clips:

look-ahead chunk PER
320 ms 8 12.06%
640 ms 16 11.74%
1000 ms 24 11.63%
offline 128 11.66%

Even aggressive 320 ms streaming is within ~0.4 pts of full context, and going past the trained 1000 ms profile does not help. Treat PER as the true accuracy signal.


🧠 Why retrieval (not free text decoding)?

The model emits phonemes, not words, so to score word-WER at all the phonemes must become text, and there is no free text decoder here. Since the Qur'an is a closed, fixed text, the faithful mapping is to match the predicted phoneme string to the nearest canonical āyah and read off its text. So retrieval is not a trick to inflate the score, it is the intrinsic decode step for a phoneme model on a closed corpus (labeled (retrieval) on the board for honesty). Most of the gap to 4.13 is the naive retrieval picking a similar but wrong short āyah, so the acoustics are better than the WER alone suggests, and a smarter decoder is the lever to go lower.


📦 Files

file what
quran_phoneme_zipformer.pt PyTorch weights (65.5M, inference-only) + blank_id
quran_phoneme_zipformer.onnx cache-aware streaming CTC ONNX (sherpa-onnx compatible)
quran_phoneme_zipformer.int8.onnx INT8 streaming ONNX (73 MB, dynamic-quantized MatMul)
scripts/export_quran_streaming_onnx.py the streaming ONNX + int8 export script
phoneme_units.json 250-unit phoneme tokenizer
quran_text2phoneme.json canonical text→phoneme table (mistake-detection / retrieval)
scripts/zipformer_rnnt_ctc_train.py build_model + PhonemeUnitTokenizer (needed to load/export)
scripts/zipformer_rnnt_ctc_eval.py greedy_ctc_decode
quran_per_eval.py, quran_wer_retrieval.py reproduce PER / board WER

🚀 Usage

icefall Zipformer2. Build with build_model from the included scripts/zipformer_rnnt_ctc_train.py, load the ["model"] weights, feed 80-bin kaldi fbank (povey) @ 16 kHz, decode CTC greedily. For the 1000 ms profile set chunk_size=(24,), left_context_frames=(256,). You also need k2-fsa/icefall cloned (for zipformer/scaling/subsampling).

from scripts.zipformer_rnnt_ctc_train import build_model, load_tokenizer
import torch
tok = load_tokenizer("phoneme_units.json"); blank = tok.get_piece_size()
model = build_model(blank + 1, blank, chunk_frames=[24], left_context_frames=256).eval()
model.load_state_dict(torch.load("quran_phoneme_zipformer.pt", map_location="cpu")["model"])
# enc, lens = model.encode(feats, feat_lens);  ids = model.ctc_head(enc).argmax(-1)

quran_per_eval.py is a complete runnable example.

ONNX (cache-aware streaming, sherpa-onnx ready): the ONNX is a chunk-by-chunk streaming zipformer2-CTC graph. Inputs are one chunk x (1,T,80) of 80-bin kaldi fbank plus the encoder cache-state tensors (cached_key/nonlin_attn/val1/val2/conv1/conv2 per layer, embed_states, processed_lens); outputs are log_probs plus the updated new_* states you feed into the next chunk. All the streaming params (decode_chunk_len, T, left_context_len, layer dims, etc.) are embedded as ONNX metadata, so sherpa-onnx configures it automatically. No PyTorch or icefall needed at inference.

🗂 Training data

Fine-tuned from Muno459/zipformer_p-arabic (broad-Arabic phoneme encoder, ~1700 h) on Qur'an recitation:

Qur'an phoneme targets are deterministic (quran-transcript / quran_phonetizer, Ḥafṣ, madd 4/4/4/4), not LLM-generated, which is why this model is sharp.

⚠️ Limitations

Phoneme output, not text (pair with retrieval or a decoder). Tuned for Ḥafṣ; other qirāʾāt out of scope. Board WER depends on the retrieval decoder; PER is the cleaner accuracy signal.

📜 License & usage agreement (gated)

Shared freely for the sake of Allah, under a free / non-commercial term, not Apache-2.0. Access is gated: you must agree to use this model and its outputs only in apps that are free to end users. No selling, no paid subscriptions or paywalls, no ad revenue, no monetization of any kind, and you pass these terms on. See LICENSE.

وَما أَسأَلُكُم عَلَيهِ مِن أَجرٍ إِن أَجرِيَ إِلّا عَلىٰ رَبِّ العالَمِين


🔬 Details (model · training · evaluation)

For reviewers: exactly how this model is built, trained, and measured.

Model

  • Encoder: icefall/k2 Zipformer2, 6 stacks, downsampling_factor=(1,2,4,8,4,2), encoder_dim=(192,256,384,512,384,256), num_heads=(4,4,4,8,4,4), cnn_module_kernel=(31,31,15,15,15,31), output downsampling 2. Causal (streaming), chunk_size=24, left_context_frames=256 for the 1000 ms profile. ~65.5M params.
  • Front end: 80-bin kaldi fbank (povey window, 25 ms / 10 ms, 16 kHz), computed the same in train and eval.
  • Head: single CTC head (ScaledLinear, 250 phoneme units + blank). No RNN-T, no language model (deliberate: an LM would auto-correct reciter mistakes and hide tajwīd errors).
  • Tokenizer: 250 phoneme units (consonant+ḥaraka units), greedy longest-match segmentation; reproduces the muaalem ṣifāt segmentation with zero unknowns.

Training

  1. Broad-Arabic pre-trainingMuno459/zipformer_p-arabic: ~1700 h (MASC, SADA, MGB2, CommonVoice, ArVoice, FLEURS), phoneme-CTC, 5 epochs, ScaledAdam + Eden LR. Broad floor ctc ≈ 1.71 (open-domain difficulty).
  2. Qur'an fine-tune (this model): --init-from the broad encoder (weights only, fresh optimizer), CTC-only, lr 0.02, batch 16, max-dur 15 s, 3 epochs. Converged to train ctc ≈ 0.065.
  • Labels are deterministic, not LLM: Qur'an phonemes come from quran-transcript / quran_phonetizer (Ḥafṣ, madd 4/4/4/4), plus muaalem gold annotations. Sources: everyayah + tlog + muaalem. This determinism is why the Qur'an model is sharp.

Evaluation

  • Benchmark: the official, leakage-free Quran-Lab/quranic-asr-benchmark: 600 held-out clips, 200 each from everyayah (clean), QUL/Al-Nufais (unseen reciter), tlog (real phone-mic). Every clip verified not in training.
  • PER (primary signal): predicted phoneme units vs deterministic Ḥafṣ gold, unit-level edit distance (scripts/quran_per_eval.py). Overall 11.63% (everyayah 7.69 / QUL 14.97 / tlog 13.49). Madd-insensitive PER ≈ identical (9.59 vs 9.64), confirming the errors are real phonemes, not vowel-length artifacts.
  • Look-ahead sweep: PER 10.1% @ 320 ms (chunk 8) → 9.8% @ 640 ms → 9.6% @ 1000 ms (chunk 24); offline (chunk 128) does not improve, so it is genuinely streaming-robust.
  • WER (board metric): the model emits phonemes, so for word-WER each prediction is mapped to the nearest canonical āyah (rapidfuzz over the 9.1k-āyah text→phoneme table) and scored with the official score.py (scripts/quran_wer_retrieval.py). Overall 5.83 WER / 3.55 CER / 4.59 alef-insensitive. This is a Qur'an-lexicon-constrained decode (labeled (retrieval) on the board); it is not free text decoding like the other entries.
  • Honest reading: treat PER as the model's true accuracy. The board WER is "phonemes + a simple lexicon"; most of the gap to the 4.13 offline model is the naive retrieval picking a near-identical wrong short āyah, not the acoustics. A smarter phoneme→text decoder is the obvious next lever.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Muno459/zipformer_p-quran

Quantized
(1)
this model

Dataset used to train Muno459/zipformer_p-quran

Space using Muno459/zipformer_p-quran 1