Feedback & Insights

#2
by hussam002 - opened

Hi Mostafa,

I hope you are doing well.

First of all, I want to thank you so much for your great work on the model. Developing assets for Quranic computing is deeply rewarding, and may Allah reward you with the best for this noble effort.

I am currently building an offline-first Android application designed to help users recite and memorize the Holy Quran. Over the past two weeks, I have been thoroughly testing your model as part of our speech recognition stack and wanted to share some practical feedback and technical insights from my implementation:

  • Offline vs. Streaming: The model currently works in offline mode (batch processing), but for a real-time recitation tracking experience, having streaming support would be a massive upgrade.
  • Auto-Correction / Over-Smoothing: The model tends to automatically correct minor pronunciation or diacritic errors. For instance, if a user recites "ู…ูŽุงู„ููƒู" (with a Dammah) instead of "ู…ูŽุงู„ููƒู" (with a Kasrah), the model still transcribes it as "ู…ูŽุงู„ููƒู". For a Quran correction/memorization app, detecting these exact phonetic or diacritic mistakes is crucial.
  • Noise & Low-Quality Mics Performance: When testing on budget mobile devices, poor quality microphones, or in noisy environments, the accuracy drops significantly.
  • Hallucinations: I noticed some unusual hallucinations in noisy environments. For example, background whistling or static noise occasionally gets transcribed into random phrases like "ุงู†ุชู‡ู‰ ุงู„ุฏุฑุณ" (the lesson is over).
  • Mobile Optimization: To make the model viable for standard Android hardware, I successfully quantized it to INT8, which brought it down to an acceptable size and performance footprint for mobile storage.

I wanted to share these observations with you in hopes that they might assist you in future iterations or fine-tuning of the model.

Thank you again for your valuable contribution, and looking forward to seeing how your project evolves!

I'm not the author of the model here but i used it elhamdle Allah
Auto correction is the normal behaviour for a FastConformer model

This model is better alot from another models for noise and low quality .
hallucinations of ุงู†ุชู‡ู‰ ุงู„ุฏุฑุณ is normal at silence

if you want to drop your project and get to contribute in my project in github i can invite you to the repo

but there is a strict rule in my repo " Never Get Purchases , never get donations , never sell apps , never get money ....Only for the sake of Allah " <3

this is my app already uploaded to playstore you can test but i used there a fastconformer clean text model nd it's matching was a bit buggy but elhamdle Allah , updated the bugs but not in playstore yet
https://recitequran.pages.dev/

1๏ธโƒฃ Step 1: Join the Early Access Group
https://groups.google.com/g/recite-quran

2๏ธโƒฃ Step 2: Download the App from Google Play
https://play.google.com/store/apps/details?id=com.recitequran.app

Wa alaikum as-salam wa rahmatullahi wa barakatuh, and jazak Allah khayran for the careful testing and for the niyyah behind your app.

This is genuinely useful feedback. Briefly on each point:

  • Streaming: agreed, this is the big one. I am training a proper cache-aware streaming model from scratch right now (not a patched offline model), and will release it as ONNX with INT8/q8 for mobile in sha Allah.
  • Noise and low-quality mics: I hear you. The new streaming model is being trained with heavy real-world augmentation (noise, gain, distortion) plus real phone-mic recitation audio, specifically to hold up better on budget devices and noisy rooms.
  • Auto-correction (ู…ูŽุงู„ููƒู vs ู…ูŽุงู„ููƒู): important to be clear, this is not specific to my model, it is inherent to all ASR systems. Any speech-to-text model trained on correct transcripts learns to map audio to the most likely valid text, so it normalizes away small pronunciation and diacritic deviations and snaps to the canonical form. It is even stronger in a closed domain like the Quran where the text is fixed, and stronger in decoders with a heavy internal language model (attention and RNN-T more than CTC). So no ASR, mine or anyone else's, will reliably flag a wrong harakah on its own. It helps to separate two tiers of mistake detection. The easier tier is word-level, missed, skipped or wrong words, which you catch by aligning the ASR transcript against the expected text. That is what Tarteel's Memorization Mistake Detection does. Your ู…ูŽุงู„ููƒู vs ู…ูŽุงู„ููƒู case is the harder tier: the word is correct, only the harakah is wrong, and word-level comparison cannot catch it because the ASR has already auto-corrected the transcript to the right word. For that you need a phoneme or acoustic level layer that compares what was actually pronounced against what is expected, which is what a pronunciation/mispronunciation head (it is in the repo) is for, and what I am training a per-phoneme tajweed model to do.
  • Hallucinations in silence/noise (ุงู†ุชู‡ู‰ ุงู„ุฏุฑุณ): classic non-speech hallucination. The most reliable fix is a VAD (voice-activity) gate in front of the model so it only transcribes actual speech, plus a confidence/blank threshold to drop low-confidence output. The augmentation training above reduces it too.
  • INT8: nice work getting it to mobile size, that matches my plan for the streaming release.

The catch is data. A phoneme/acoustic layer only learns to report an error if it has actually seen that error, labelled as such, many times in training. If you train it on clean recitation alone, it has no signal for what a wrong harakah sounds like and will do exactly what the ASR does: default to the canonical mapping and snap ู…ูŽุงู„ููƒู back to ู…ูŽุงู„ููƒู. So this layer needs a large corpus of incorrect recitation covering the full space of deviations, realistically on the order of 500+ hours spanning all the common error variants (each wrong case ending, each substituted, elongated or shortened vowel, each missed ghunnah or madd), with annotations of what was actually pronounced rather than the canonical text. Without that breadth and volume, coverage is too sparse: the model sees a given error type a handful of times, can't separate it from noise, and falls back to the safe canonical output. The faithful-decoding architecture is necessary but not sufficient. It is what lets the error survive into the output; the labelled error data is what teaches the model to recognize and name it.

Jazak Allah khayran again, and feel free to reach out anytime. May Allah make your app a means of khayr for the Ummah.

May Allah our lord accept
this is a complex work my brother <3

Jazak Allah khayran to both of you for your time, valuable feedback, and beautiful intentions. May Allah reward you both with the best in this life and the hereafter.

@TheGreatQuran2026
Jazak Allah khayran for the kind invitation and for sharing your project. I have installed and tested your app, and it is truly impressive, may Allah bless your efforts. My own app aims to provide a similar experience to Tarteel, operating within available constraints: not as a competitor, but as a humble addition to this space. Like yours, my app is completely free, ad-free, and dedicated solely for the sake of Allah.

@Muno459
Jazak Allah khayran for your detailed explanation and for your ongoing dedication. I am well aware that standard ASR models tend to snap to canonical forms when untrained for specific errors; your insight on the necessity of phoneme-level training and large-scale error corpora is spot on. For my current implementation, I am successfully utilizing a VAD, which has significantly mitigated the hallucination issues. I truly appreciate your transparency regarding the data challenge, and I pray that Allah grants you the strength, resources, and success in completing this model for the benefit of the Ummah. I will be eagerly awaiting the release of your new streaming model, in sha Allah.

Jazak Allah khayran to both of you for your time, valuable feedback, and beautiful intentions. May Allah reward you both with the best in this life and the hereafter.

@TheGreatQuran2026
Jazak Allah khayran for the kind invitation and for sharing your project. I have installed and tested your app, and it is truly impressive, may Allah bless your efforts. My own app aims to provide a similar experience to Tarteel, operating within available constraints: not as a competitor, but as a humble addition to this space. Like yours, my app is completely free, ad-free, and dedicated solely for the sake of Allah.

@Muno459
Jazak Allah khayran for your detailed explanation and for your ongoing dedication. I am well aware that standard ASR models tend to snap to canonical forms when untrained for specific errors; your insight on the necessity of phoneme-level training and large-scale error corpora is spot on. For my current implementation, I am successfully utilizing a VAD, which has significantly mitigated the hallucination issues. I truly appreciate your transparency regarding the data challenge, and I pray that Allah grants you the strength, resources, and success in completing this model for the benefit of the Ummah. I will be eagerly awaiting the release of your new streaming model, in sha Allah.


Don't Forget to say " "Our Lord, accept from us. Indeed You are the Hearing, the Knowing. "
ุฑุจู†ุง ุชู‚ุจู„ ู…ู†ุง ุงู†ูƒ ุงู†ุช ุงู„ุณู…ูŠุน ุงู„ุนู„ูŠู…
Also if you send me your github i can invite you if you want to use anything from the repo in your app insha'a Allah <3
Mash'a Allah , this model will change alot of things ( the streaming will drop the sliding window causing cpu usage , the tashkeel checking , the tajweed check ) el hamdle Allah

Thank you my brothers <3

Sign up or log in to comment