Shenava-Rizeh 0.9 — Persian Cache-Aware Streaming ASR (32M)

A tiny (32M-param) cache-aware, multi-latency streaming Persian (Farsi) ASR model — FastConformer-Hybrid (RNNT + CTC), 16 kHz. Built for fully offline, on-device, real-time captioning (WebGPU / WASM / NeMo), part of the VisualEars project (SLT 2026).

One model serves the entire latency–accuracy curve (0 / 80 / 480 / 1040 ms) — pick your operating point at inference time, no re-training. Its larger sibling is Shenava-Koochik-0.9 (114M).

📊 Results — Golden6669 (held-out gold Persian eval)

Evaluated on Reza2kn/visualears-golden-6669 (6,669 clips, official Persian normalizer), RNNT head:

`att_context_size`	Latency	WER	CER	WER_bf
`[70, 0]`	0 ms (real-time)	11.08%	3.17%	9.68%
`[70, 1]`	80 ms	10.85%	3.08%	9.43%
`[70, 6]`	480 ms	10.56%	2.93%	9.14%
`[70, 13]`	1040 ms	10.46%	2.89%	9.03%

The curve is nearly flat — only 0.62 pp WER from 0 → 1040 ms. You get near-best accuracy at zero lookahead: a 32M model doing 11.08% WER / 3.17% CER at true real-time, fully on-device. For reference, the previous-generation single-latency fa32M scored 17.40% (and could not run below its trained latency).

Flatness holds per-condition

The latency penalty is uniform across acoustic conditions — low-latency does not fray on the hard far-field/obstructed tail (Golden6669 is deliberately ~94% non-clean):

Condition	n	0 ms WER	1040 ms WER	Δ
clean	400	13.38	12.62	+0.76
obstructed	4,335	10.87	10.20	+0.67
far-field	1,934	11.24	10.77	+0.47

Far-field has the smallest 0→1040 ms gap. (clean scores worst here — a quirk of Golden6669's small 400-clip clean slice, not a streaming effect.)

WER_bf = boundary-forgiven WER (utterances perfect modulo Persian word-spacing conventions counted correct).

How it was trained

Base: nvidia/stt_en_fastconformer_hybrid_medium_streaming_80ms (English cache-aware streaming), persianized by swapping in a Persian BPE-1024 tokenizer and reinitializing the decoder + joint (encoder kept).
Multi-latency: att_context_size = [[70,13],[70,6],[70,1],[70,0]] (chunked_limited) — one checkpoint covers 0 / 80 / 480 / 1040 ms.
Phase A: ~7,386 h / 3.66M clips of cleaned, teacher-pseudo-labeled Persian (Reza2kn/visualears-persian-pseudo-asr).
Phase B: gold-anchor fine-tune on 355 human-verified gold + active-learning corrections.
Trajectory: random-init decoder → Phase A 17.94% → Phase B 10.46% (@1040 ms).

Usage (NeMo)

from nemo.collections.asr.models import ASRModel
m = ASRModel.restore_from("shenava-rizeh-0.9.nemo").cuda().eval()
m.encoder.set_default_att_context_size([70, 0])      # 0 ms (real-time); or [70,13] for best WER
print(m.transcribe(["clip.wav"])[0].text)

[70,0]=0 ms · [70,1]=80 ms · [70,6]=480 ms · [70,13]=1040 ms (1 encoder frame = 80 ms, FastConformer subsampling 8).

Notes

Version 0.9 — Phase B on 355 human-verified gold. v1.0 follows a larger active-learning gold round (the 6K worst-disagreement clips now under review on Argilla).
The CTC head is the deployment head for real-time/WebGPU; RNNT is the higher-accuracy offline/rescorer head.
Larger streaming sibling: Shenava-Koochik-0.9 (114M). Offline flagship: shenava-fa-fastconformer-115m (7.29%).

Citation

@misc{shenava_rizeh_2026,
  title  = {Shenava-Rizeh: Persian Cache-Aware Streaming ASR (32M)},
  author = {Sayar, Reza},
  year   = {2026},
  howpublished = {Hugging Face},
  url    = {https://huggingface.co/Reza2kn/Shenava-Rizeh-0.9}
}

Downloads last month: -

Model tree for Reza2kn/Shenava-Rizeh-0.9

Base model

nvidia/stt_en_fastconformer_hybrid_medium_streaming_80ms

Finetuned

(2)

this model

Quantizations

1 model