Shenava — Rizeh-Pizeh v1.0 (6.9M) · cache-aware streaming · native-Rust (tract)
Cache-aware streaming CTC export of Shenava-Rizeh-Pizeh-v1.0
that runs in the pure-Rust tract engine — no C++, no ONNX Runtime.
Part of VisualEars / Shenava: offline, on-device, streaming Persian ASR for the Deaf/Hard-of-Hearing.
Quality: intelligible (24.55% golden-6669 WER); real-time on a 2015 Cortex-A7. RTF ≈ 0.018 (20.3 ms/chunk on x86 CPU; chunk = 1.12 s audio).
⚠️ Requires patched tract (until upstreamed)
Stock tract rejects NeMo cache-aware streaming graphs in two inference-layer spots. Fix = a 23-line, 2-file patch
(shenava_tract_streaming.patch, included) — PR open at sonos/tract#2441.
Build tract with the patch, then load model.onnx normally. The graph itself is valid (identical decode to ONNX Runtime).
Streaming contract
Per-step inputs / outputs (fixed shapes, greedy CTC):
audio_signal[1,80,121]— un-normalized log-mel chunk (NeMo featurizer,normalize=NA)length[1]i64 — true valid frames in the chunkcache_last_channel[1,12,70,144],cache_last_time[1,12,144,8],cache_last_channel_len[1]i64 — start zeros / 0- →
logprobs[1,T',1025]+ next caches
Chunking: feed 121-mel-frame chunks, shift 112 (9-frame pre-encode overlap). First chunk is 105 → pad to 121; pad the tail too; pass the true length. Thread the *_next caches back each step (cast cache_last_channel_len_next to i64). Greedy CTC: carry the previous token across chunk boundaries when collapsing repeats; blank id = 1024; map via tokens.txt; ▁→space.
Numbers are spoken-form → ITN
The model spells numbers (هشت not ۸). Apply persian_itn.py at display for spoken→Persian-digit (cardinals + هزار/میلیون/میلیارد + «و» + compounds).
Shenava-1 family (all native-Rust streaming)
- Koochik 114M — flagship
- Rizeh 32M — mid
- Rizeh-Pizeh 6.9M — tiniest
Quantized variants — int4 / int8 (NEW)
Our streaming support is merged into tract main (sonos/tract#2441), which also ships int4 (MatMulNBits -> Q4_0) and int8 GEMM kernels. So tract main runs quantized versions of this streaming model:
| file | precision | size | notes |
|---|---|---|---|
model.onnx |
fp32 | 33MB | reference |
model.int4.onnx |
int4 (MatMulNBits / Q4_0, weight-only) | 14MB | ⭐ recommended — 2.4x smaller, ~fp32 speed, byte-identical decode |
model.int8.onnx |
int8 (matmul-only, MatMulInteger) | 17MB | byte-identical; slower on small-batch streaming (per-matmul DynamicQuantizeLinear) — best for large-batch / offline, or CPUs where it wins |
Both quants decode byte-identically to fp32. For edge/on-device streaming, use model.int4.onnx (weight-only, no per-matmul activation quant). Needs tract main — the streaming fixes are upstream now, so the bundled .patch is no longer required.
Model tree for Reza2kn/Shenava-Rizeh-Pizeh-v1.0-tract-streaming
Base model
nvidia/stt_fa_fastconformer_hybrid_large