ordered_quran_phonemes.json (phoneme words mapped to each ayah text in order with surah and aya number)
ربنا تقبل منا انك انت السميع العليم
exported using obadx/quran-transcript
This is V1 , you can run a simple script to check for missing words or connected words ( i skipped this now ) but ayahs are full elhamdle Allah
only you may face connected words .....but i patched it elhamdle Allah
Example :
"2:6": {
"aya_text": "إِنَّ ٱلَّذِينَ كَفَرُوا۟ سَوَآءٌ عَلَيْهِمْ ءَأَنذَرْتَهُمْ أَمْ لَمْ تُنذِرْهُمْ لَا يُؤْمِنُونَ",
"aya_phonemes_list": [
"ءِننننَ",
"للَذِۦۦنَ",
"كَفَرُۥۥ",
"سَوَااااءُن",
"عَلَيهِم",
"ءَءَںںںذَرتَهُم",
"ءَم",
"لَم",
"تُںںںذِرهُم",
"لَاا",
"يُءمِنُۥۥن"
],
"aya_phoneme": "ءِننننَ للَذِۦۦنَ كَفَرُۥۥ سَوَااااءُن عَلَيهِم ءَءَںںںذَرتَهُم ءَم لَم تُںںںذِرهُم لَاا يُءمِنُۥۥن"
}
Jazak Allahu khayran for this, the idea and the surah:aya structure are exactly right, and using quran-transcript is the correct source.
I went to wire it in and verified the phonemes against the model's actual output scheme, and found one issue worth explaining so the matching works reliably: the per-word aya_phonemes_list was phonemized word-in-isolation, which against this model is ~17.5% CER even on a perfect recitation. In isolation each word gets a wasl-alef turned into a full hamza (ءَللَااه instead of the elided للَااه), loses its connecting i'rab vowel, and uses a shorter madd. The model was trained on connected recitation, so it never emits those isolated forms.
So I generated a canonical ordered_quran_phonemes.json (now on main) phonemized as connected recitation with the exact MoshafAttributes the model was trained on (hafs, madd 4/4/4/4). I verified it end to end: all 6236 ayat, 0 symbols outside the model vocab, the per-word list joins back to the full string, it matches the deployed retrieval reference verbatim on 2646 ayat (the rest differ only because that file is segment-windowed), and 2.8% mean CER vs the muaalem gold (pure madd convention, expected).
One important note for whoever builds the matcher: phonetic words are not 1:1 with orthographic words (idgham and wasl merge adjacent words, and the muqatta'at like الم expand into several phonetic tokens), so do sequence alignment over the full aya_phoneme string rather than mapping word index to word index. That is also the more robust approach for mistake detection.
Closing this in favor of the canonical version on main, but thank you, this was a genuinely useful push and the structure is what I kept.