Instructions to use Muno459/fastconformer-quran-streaming with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- NeMo
How to use Muno459/fastconformer-quran-streaming with NeMo:
import nemo.collections.asr as nemo_asr asr_model = nemo_asr.models.ASRModel.from_pretrained("Muno459/fastconformer-quran-streaming") transcriptions = asr_model.transcribe(["file.wav"]) - Notebooks
- Google Colab
- Kaggle
Mash'a Allah , that is an awesome work
Elhamdle Allah <3
Thank you <3
this a first model for the Great Quran to be a streaming model
Also it has high accuracy , arabic output with tashkeel , encoder output , and q8 versions Mash'a Allah
don't forget to add the q8 version for the streaming encoder <3
soon insha'a Allah i will share the repo with you for the flutter app
elhamdle Allah , this streaming model doesn't use alot of cpu usage which was used for the stimulated streaming for the offline model
sorry for a lot of requests and comments my brother <3
Wa alaykum as-salam, jazak Allah khayr <3
Done, added model_with_encoder.q8.onnx (INT8 dynamic-quantized, ~132 MB vs 438 MB fp32). Same I/O as the fp32 encoder (inputs audio_signal, length -> outputs logprobs + encoder_output), so it is a drop-in for the encoder-feature / pronunciation-head path in the Flutter app, just lighter.
Quick notes for the app:
- model.q8.onnx = streaming ASR (logprobs + cache only). model_with_encoder.q8.onnx = same weights but also exposes encoder_output, which you need to feed the pronunciation head (head/pronunciation_head.pt).
- And yes, the cache-aware streaming model is much lighter than the simulated-streaming offline approach (no overlap re-decode each step), so CPU stays low, exactly as you saw.
No need to apologize for the requests at all, keep them coming <3
thanks my brother <3
when i finish the project I'll share it with you insha'a Allah
May Allah accept <3
sorry my brother for alot of questions <3
i was asking is this the right input/output of the new streaming model with encoder or i am using the wrong onnx again hahah <3
--- Inspecting: model.onnx ---
[INPUTS]
Name: 'audio_signal' | Shape: [B, 80, T_in]
Name: 'length' | Shape: [B]
[OUTPUTS]
Name: 'logprobs' | Shape: [B, T_out, 1025]
Name: 'encoder_output' | Shape: [B, 512, T_out]
if this is then it has no the cache aware to work as streaming with the encoder output ?
i thought i would use the same streaming model for word by word real time and for tajweed record and stop since it can output the encoder and logprobs
Wa alaykum as-salam wa rahmatullah, jazak Allah khayr <3
No need to apologize, ask as much as you like my brother <3
Good catch, you are not using it wrong, you just have the wrong file open. What you inspected there (audio_signal, length to logprobs + encoder_output, no cache tensors) is actually model_with_encoder.onnx, not model.onnx. The cache-aware graph has 5 inputs/outputs (the cache_* tensors). The repo ships two graphs of the same weights:
- model.onnx / model.q8.onnx: cache-aware streaming. Inputs: audio_signal, length, cache_last_channel, cache_last_time, cache_last_channel_len. Outputs: logprobs, encoded_lengths, cache_*_next. This is the one for live word-by-word, carry the cache tensors across chunks. It does not emit encoder_output.
- model_with_encoder.onnx / .q8: full-context (no cache). Inputs: audio_signal, length. Outputs: logprobs + encoder_output. This is the one for record-then-tajweed: you have the full clip, run it once, feed encoder_output to the pronunciation head.
So directly: that file is right for the tajweed / record-and-stop path, but it cannot do real-time word-by-word (no cache means it re-processes the whole buffer every step). For live streaming use model.q8.onnx.
On using one model for both: right now they are split on purpose (live text needs cache but not features, tajweed needs features but the recording is already finished so no cache). Since you want one model that streams and also exposes encoder_output for live tajweed, we are currently training the encoder features with streaming and will update the repo with a combined cache-aware + encoder_output version insha'Allah.
One gotcha for Flutter: audio_signal is 80-dim log-mel, not raw audio. See streaming_inference_example.py for the exact mel + CMVN params (n_fft 512, win 400, hop 160, 80 mels, ln(x + 2^-24), then the fixed CMVN from streaming_global_cmvn.npz). Raw PCM in gives garbage out.
May Allah accept from us and you <3
Update brother, it is live now <3
I added model_streaming_with_encoder.onnx (+ q8) to the repo: cache-aware streaming AND it returns encoder_output in the same pass. So one model does live word-by-word ASR and live tajweed together (logprobs + encoder_output + next-step cache out of a single forward). That is the one to use if you want pronunciation feedback while reciting.
I also benchmarked whether the streaming encoder hurts tajweed scoring vs the offline encoder, since that was the worry. Short answer: it barely changes. The ASR text is worse on streaming (about 4% WER offline vs 12% streaming, because text needs long context), but pronunciation detection is essentially the same, because it is a local judgement per letter:
| Encoder feeding the head | Pron. AUC | TPR @ 1% FPR | TPR @ 5% FPR |
|---|---|---|---|
| Offline (full context) | 0.980 | 92.7% | 94.8% |
| Streaming [70,13] | 0.984 | 92.3% | 96.5% |
So you can score tajweed live off the streaming encoder with basically no loss versus offline.
Quick guide:
- model.q8.onnx -> live text only (lightest)
- model_streaming_with_encoder.q8.onnx -> live text + live tajweed in one pass
- model_with_encoder.q8.onnx -> tajweed after the user stops (full clip, slightly higher text accuracy)
The README has the full I/O and cache init. Keep the questions coming, may Allah accept <3
May Allah accept <3
I am also will make a local segmenter using the offline model insha'a Allah <3 , for an app called Quran Caption (someone made for the sake of Allah ) you can follow here
https://github.com/zonetecde/qurancaption
thank you for caring and fast responsing <3