Shenava-Rizeh 0.9 β€” Persian Cache-Aware Streaming ASR (32M)

A tiny (32M-param) cache-aware, multi-latency streaming Persian (Farsi) ASR model β€” FastConformer-Hybrid (RNNT + CTC), 16 kHz. Built for fully offline, on-device, real-time captioning (WebGPU / WASM / NeMo), part of the VisualEars project (SLT 2026).

One model serves the entire latency–accuracy curve (0 / 80 / 480 / 1040 ms) β€” pick your operating point at inference time, no re-training. Its larger sibling is Shenava-Koochik-0.9 (114M).

πŸ“Š Results β€” Golden6669 (held-out gold Persian eval)

Evaluated on Reza2kn/visualears-golden-6669 (6,669 clips, official Persian normalizer), RNNT head:

att_context_size Latency WER CER WER_bf
[70, 0] 0 ms (real-time) 11.08% 3.17% 9.68%
[70, 1] 80 ms 10.85% 3.08% 9.43%
[70, 6] 480 ms 10.56% 2.93% 9.14%
[70, 13] 1040 ms 10.46% 2.89% 9.03%

The curve is nearly flat β€” only 0.62 pp WER from 0 β†’ 1040 ms. You get near-best accuracy at zero lookahead: a 32M model doing 11.08% WER / 3.17% CER at true real-time, fully on-device. For reference, the previous-generation single-latency fa32M scored 17.40% (and could not run below its trained latency).

Flatness holds per-condition

The latency penalty is uniform across acoustic conditions β€” low-latency does not fray on the hard far-field/obstructed tail (Golden6669 is deliberately ~94% non-clean):

Condition n 0 ms WER 1040 ms WER Ξ”
clean 400 13.38 12.62 +0.76
obstructed 4,335 10.87 10.20 +0.67
far-field 1,934 11.24 10.77 +0.47

Far-field has the smallest 0β†’1040 ms gap. (clean scores worst here β€” a quirk of Golden6669's small 400-clip clean slice, not a streaming effect.)

WER_bf = boundary-forgiven WER (utterances perfect modulo Persian word-spacing conventions counted correct).

How it was trained

  • Base: nvidia/stt_en_fastconformer_hybrid_medium_streaming_80ms (English cache-aware streaming), persianized by swapping in a Persian BPE-1024 tokenizer and reinitializing the decoder + joint (encoder kept).
  • Multi-latency: att_context_size = [[70,13],[70,6],[70,1],[70,0]] (chunked_limited) β€” one checkpoint covers 0 / 80 / 480 / 1040 ms.
  • Phase A: ~7,386 h / 3.66M clips of cleaned, teacher-pseudo-labeled Persian (Reza2kn/visualears-persian-pseudo-asr).
  • Phase B: gold-anchor fine-tune on 355 human-verified gold + active-learning corrections.
  • Trajectory: random-init decoder β†’ Phase A 17.94% β†’ Phase B 10.46% (@1040 ms).

Usage (NeMo)

from nemo.collections.asr.models import ASRModel
m = ASRModel.restore_from("shenava-rizeh-0.9.nemo").cuda().eval()
m.encoder.set_default_att_context_size([70, 0])      # 0 ms (real-time); or [70,13] for best WER
print(m.transcribe(["clip.wav"])[0].text)

[70,0]=0 ms Β· [70,1]=80 ms Β· [70,6]=480 ms Β· [70,13]=1040 ms (1 encoder frame = 80 ms, FastConformer subsampling 8).

Notes

  • Version 0.9 β€” Phase B on 355 human-verified gold. v1.0 follows a larger active-learning gold round (the 6K worst-disagreement clips now under review on Argilla).
  • The CTC head is the deployment head for real-time/WebGPU; RNNT is the higher-accuracy offline/rescorer head.
  • Larger streaming sibling: Shenava-Koochik-0.9 (114M). Offline flagship: shenava-fa-fastconformer-115m (7.29%).

Citation

@misc{shenava_rizeh_2026,
  title  = {Shenava-Rizeh: Persian Cache-Aware Streaming ASR (32M)},
  author = {Sayar, Reza},
  year   = {2026},
  howpublished = {Hugging Face},
  url    = {https://huggingface.co/Reza2kn/Shenava-Rizeh-0.9}
}
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Reza2kn/Shenava-Rizeh-0.9

Finetuned
(2)
this model
Quantizations
1 model