conversion to onnx needs script folder

#2
by TheGreatQuran2026 - opened

Alslamo alaikom brother <3
i really love you for doing for the sake of Allah the greatest
ربنا تقبل منا انك انت السميع العليم

May Allah accept from us
I tried to export to onnx using icefall repo export file (edited to fit this model) but it requires 2 files (zipformer_rnnt_ctc_train.py and zipformer_rnnt_ctc_eval.py )
you can use the export script

Export Script

Folder Structure
📁 zipformer_p-quran/ <-- (The Root Directory)

├── 📁 icefall/ <-- (Cloned k2-fsa/icefall repository)
├── 📁 scripts/ <-- [MISSING IN HF REPO] Needs zipformer_rnnt_ctc_train.py & eval.py

├── 📄 phoneme_units.json
├── 📄 quran_text2phoneme.json
├── 📄 quran_phoneme_zipformer.pt

└── 📄 export-onnx-streaming-ctc.py <-- (The script provided below)

Command to run

python export-onnx-streaming-ctc.py
--epoch 99
--avg 1
--exp-dir ./exp
--causal True
--chunk-size 24
--left-context-frames 256
--enable-int8-quantization 1

wa ʿalaykum as-salām wa raḥmatullāh, brother 🤍 ربنا يتقبل منك ومنا.

Done. I just added the two files to the repo under scripts/:

  • scripts/zipformer_rnnt_ctc_train.py (has build_model + the PhonemeUnitTokenizer)
  • scripts/zipformer_rnnt_ctc_eval.py (has greedy_ctc_decode)

One note on the checkpoint: quran_phoneme_zipformer.pt is a single file with the weights under the "model" key (not icefall's epoch-N/exp-dir layout). So instead of --epoch 99 --avg 1, build the model and load directly:

from scripts.zipformer_rnnt_ctc_train import build_model, load_tokenizer
tok = load_tokenizer('phoneme_units.json'); blank = tok.get_piece_size(); vocab = blank + 1
model = build_model(vocab, blank, chunk_frames=[24], left_context_frames=256).eval()
import torch; model.load_state_dict(torch.load('quran_phoneme_zipformer.pt', map_location='cpu')['model'])
# now trace model.encoder.forward / model.ctc_head for your streaming ONNX export

Use chunk_size=(24,), left_context_frames=(256,) for the 1000 ms streaming profile (it's causal, so smaller chunks work too). You still need k2-fsa/icefall cloned for the zipformer/scaling/subsampling modules build_model imports, exactly as in your folder layout. Let me know if the export still trips on anything 🤝

Update, brother 🤍 you no longer need to export it yourself, I did it up front and pushed the ONNX to the repo:

  • quran_phoneme_zipformer.onnx , fp32 (encoder + CTC, full-utterance), runs with plain onnxruntime, no PyTorch/icefall needed.
  • quran_phoneme_zipformer.int8.onnx , dynamic-INT8 (q8), ~75 MB (about 3.5x smaller), and it matches the fp32 output 100% on argmax.
  • scripts/export_quran_onnx.py , the exact export script if you want to reproduce or re-export.

Inputs: 80-bin kaldi fbank feats (B,T,80) + feat_lens; outputs log_probs (B,T',251) + out_lens (argmax then collapse for phonemes). Use B=1; the graph zero-pads the time axis internally so the icefall downsample traces cleanly (that was the bit your export was tripping on).

import onnxruntime as ort
s = ort.InferenceSession('quran_phoneme_zipformer.int8.onnx', providers=['CPUExecutionProvider'])
log_probs, out_lens = s.run(None, {'feats': feats, 'feat_lens': lens})

جزاك الله خيرًا for the push, it made the repo better for everyone.

Elhamdule Allah and Thanks for fast responsing my brother <3
i did nothing , Elhamdule Allah for Allah's Guide
Thanks for this awesome model

this now is the best model in these days insha'a Allah

May Allah the great accept me and you in the jannat in the life after <3

when i tried this onnx and added metadata to run directly on sherpa onnx it faced an error
when i traced it and analyzed the export script
it was found that the onnx was exported as an offline one .
can you check please ?
also refer to this export as i am trying to export but i am facing k2 module errors in windows and i dont have bandwith to download wsl

Export Streaming Script
python export_streaming.py --exp-dir ./exp --enable-int8-quantization 1

Folder Structure
📁 zipformer_p-quran/ <-- (The Root Directory)

├── 📁 icefall/ <-- (Cloned k2-fsa/icefall repository)
├── 📁 scripts/

├── 📄 phoneme_units.json
├── 📄 quran_phoneme_zipformer.pt

└── 📄 export-streaming.py <-- (The script provided )

I'm still trying to fix k2 module errors to try to export if i didn't , i will download wsl insha'a Allah

sorry for custom or specified requests

Fixed, brother 🤍 you were right, the first one was a full-utterance (offline) graph. I re-exported it properly as a cache-aware streaming zipformer2-CTC ONNX and replaced the files in the repo:

  • quran_phoneme_zipformer.onnx , streaming, chunk-by-chunk.
  • quran_phoneme_zipformer.int8.onnx , streaming int8 (~73 MB).

It is the standard sherpa-onnx online zipformer2-CTC format: input is one chunk x (1,T,80) + the encoder cache states (cached_key/nonlin_attn/val1/val2/conv1/conv2 per layer, embed_states, processed_lens), outputs log_probs + the new_* states for the next chunk. All the streaming params (model_type=zipformer2, decode_chunk_len=48, T=61, left_context_len, layer dims) are embedded as ONNX metadata, so sherpa-onnx reads them automatically , no manual config. Exported with icefall's own export-onnx-streaming-ctc path, so it should drop straight into the sherpa-onnx online recognizer.

scripts/export_quran_streaming_onnx.py is in the repo if you want to see exactly how. Let me know if sherpa is happy with it now 🤝

Thanks my brother <3
i was making a fight to fix the k2 and didn't fix hahaha
but elhamdule Allah , you were faster than me
iam testing now insha'a Allah

This comment has been hidden

the pervious problem was a shifting in tokens.txt (maybe because i fixed but didn't test on android yet )
Testing on pc q8 :
الحمد لله رب العالمين

✅ Recording finished. Streaming audio to model in chunks...

[05.9s] Partial: ء َ ل ح َ م د ُ ل ِ ل ل َ ا ا ه ِ ر َ ب ب ِ ل ع َ ا ا ل َ م ِ ۦ ۦ ن

ءَلحَمدُلِللَااهِرَببِلعَاالَمِۦۦن
الحمد لله رب العالمين

Testing on pc fp32 :
✅ Recording finished. Streaming audio to model in chunks...

[05.9s] Partial: ء َ ل ح َ م د ُ ل ِ ل ل َ ا ا ه ِ ر َ ب ب ِ ل ع َ ا ا ل َ م ِ ۦ ۦ ن

ءَلحَمدُلِللَااهِرَببِلعَاالَمِۦۦن

Elhamdule Allah
Thanks my brother <3
Sorry for headache

I'll share the flutter project with you so you can make an ios version and share it on your account (if you want ) insha'a Allah

elhamdle Allah the greatest

you are lucky my brother that god chosen you to do this work

this is an instant output
I/flutter ( 6956): [ASR] ⚡ 216ms | "بِسمِللَ" | final=false
I/flutter ( 6956): [ASR] ⚡ 65ms | "بِسمِللَ" | final=false
I/flutter ( 6956): [ASR] ⚡ 3ms | "بِسمِللَ" | final=false
I/flutter ( 6956): [ASR] ⚡ 213ms | "بِسمِللَااهِ" | final=false
I/flutter ( 6956): [ASR] ⚡ 58ms | "بِسمِللَااهِ" | final=false
I/flutter ( 6956): [ASR] ⚡ 4ms | "بِسمِللَااهِ" | final=false
I/flutter ( 6956): [ASR] ⚡ 213ms | "بِسمِللَااهِررَحمَاا" | final=false
I/flutter ( 6956): [ASR] ⚡ 61ms | "بِسمِللَااهِررَحمَاا" | final=false
I/flutter ( 6956): [ASR] ⚡ 5ms | "بِسمِللَااهِررَحمَاا" | final=false
I/flutter ( 6956): [ASR] ⚡ 212ms | "بِسمِللَااهِررَحمَاانِررَ" | final=false
I/flutter ( 6956): [ASR] ⚡ 61ms | "بِسمِللَااهِررَحمَاانِررَ" | final=false
I/flutter ( 6956): [ASR] ⚡ 4ms | "بِسمِللَااهِررَحمَاانِررَ" | final=false
I/flutter ( 6956): [ASR] ⚡ 233ms | "بِسمِللَااهِررَحمَاانِررَحِ" | final=false
I/flutter ( 6956): [ASR] ⚡ 81ms | "بِسمِللَااهِررَحمَاانِررَحِ" | final=false
I/flutter ( 6956): [ASR] ⚡ 3ms | "بِسمِللَااهِررَحمَاانِررَحِ" | final=false
I/flutter ( 6956): [ASR] ⚡ 211ms | "بِسمِللَااهِررَحمَاانِررَحِۦۦم" | final=false
I/flutter ( 6956): [ASR] ⚡ 60ms | "بِسمِللَااهِررَحمَاانِررَحِۦۦم" | final=false
I/flutter ( 6956): [AUDIO] 🔇 VAD OFF
I/flutter ( 6956): [ASR] ⚡ 12ms | "بِسمِللَااهِررَحمَاانِررَحِۦۦم" | final=false
I/flutter ( 6956): [AUDIO] 🎤 VAD ON (Recovered 18 pre-roll frames)
I/flutter ( 6956): [ASR] ⚡ 220ms | "بِسمِللَااهِررَحمَاانِررَحِۦۦم" | final=false
I/flutter ( 6956): [ASR] ⚡ 64ms | "بِسمِللَااهِررَحمَاانِررَحِۦۦم" | final=false
I/flutter ( 6956): [ASR] ⚡ 176ms | "بِسمِللَااهِررَحمَاانِررَحِۦۦم" | final=false
I/flutter ( 6956): [ASR] ⚡ 18ms | "بِسمِللَااهِررَحمَاانِررَحِۦۦم" | final=false
I/flutter ( 6956): [ASR] ⚡ 3ms | "بِسمِللَااهِررَحمَاانِررَحِۦۦم" | final=false
I/flutter ( 6956): [ASR] ⚡ 221ms | "بِسمِللَااهِررَحمَاانِررَحِۦۦم" | final=false
I/flutter ( 6956): [ASR] ⚡ 69ms | "بِسمِللَااهِررَحمَاانِررَحِۦۦم" | final=false
I/flutter ( 6956): [ASR] ⚡ 5ms | "بِسمِللَااهِررَحمَاانِررَحِۦۦم" | final=false
I/flutter ( 6956): [AUDIO] 🔇 VAD OFF
I/flutter ( 6956): [ASR] ⚡ 8ms | "بِسمِللَااهِررَحمَاانِررَحِۦۦم" | final=false
I/flutter ( 6956): [AUDIO] 🎤 VAD ON (Recovered 18 pre-roll frames)
I/flutter ( 6956): [ASR] ⚡ 212ms | "بِسمِللَااهِررَحمَاانِررَحِۦۦم" | final=false
I/flutter ( 6956): [ASR] ⚡ 241ms | "بِسمِللَااهِررَحمَاانِررَحِۦۦم" | final=false
I/flutter ( 6956): [ASR] ⚡ 88ms | "بِسمِللَااهِررَحمَاانِررَحِۦۦم" | final=false
I/flutter ( 6956): [ASR] ⚡ 3ms | "بِسمِللَااهِررَحمَاانِررَحِۦۦم"

Alhamdulilah, and thank you for your kind words.

Can you tell me if this feels faster/more accurate than fastconformer and if so by how much?

I will make tests on both insha'a Allah
but since i ran the streaming model of fastconformer on onnxruntime (it ofcourse slower than sherpa)

but since the minimum chunk size of the streaming fastconformer it was much slower than this model
this model is almost instant
here the ms is between 5 to 300 max on my old android cpu while in the fastconformer it started from 500ms (there was a delay )
accuracy :
still i am testing this but in the fastconformer streaming , it never got words like : الم
and sometimes hallincated

the streaming had a forgetting problem which when the context increased 10-20 words or maybe 1 minute without clearing the output , it then starts randomizing and losing alot of letters

Sign up or log in to comment