Shenava-Koochik 0.9 — Persian Cache-Aware Streaming ASR (114M)

A 114M-param cache-aware, multi-latency streaming Persian (Farsi) ASR model — FastConformer-Hybrid (RNNT + CTC), 16 kHz. Built for fully offline, on-device, real-time captioning (WebGPU / WASM / NeMo), part of the VisualEars project (SLT 2026).

One model serves the entire latency–accuracy curve (0 / 80 / 480 / 1040 ms) — pick your operating point at inference time, no re-training. Its tiny sibling is Shenava-Rizeh-0.9 (32M).

📊 Results — Golden6669 (held-out gold Persian eval)

Evaluated on Reza2kn/visualears-golden-6669 (6,669 clips, official Persian normalizer), RNNT head:

`att_context_size`	Latency	WER	CER
`[70, 0]`	0 ms (real-time)	8.57%	2.35%
`[70, 1]`	80 ms	8.29%	2.13%
`[70, 6]`	480 ms	7.71%	1.81%
`[70, 13]`	1040 ms	7.60%	1.77%

Boundary-forgiven WER ranges 7.18% (0 ms) → 6.13% (1040 ms).

The streaming 114M @1040 ms (7.60% / 1.77%) essentially matches the offline flagship shenava-fa-fastconformer-115m (7.29% / 1.63%) within ~0.3 pp — but it streams (one checkpoint, 0 → 1040 ms). At zero lookahead it does 8.57% WER / 2.35% CER at true real-time, fully on-device — sub-9% real-time Persian streaming ASR that didn't exist before. The 0 → 1040 ms span is only 0.97 pp.

How it was trained

Base: nvidia/stt_en_fastconformer_hybrid_large_streaming_multi (English cache-aware multi-latency streaming), persianized by swapping in a Persian BPE-1024 tokenizer and reinitializing the decoder + joint (encoder kept).
Multi-latency: trained with att_context_size = [[70,13],[70,6],[70,1],[70,0]] (chunked_limited) so a single checkpoint covers 0 / 80 / 480 / 1040 ms.
Phase A: ~7,386 h / 3.66M clips of cleaned, teacher-pseudo-labeled Persian (Reza2kn/visualears-persian-pseudo-asr).
Phase B: gold-anchor fine-tune on human-verified gold + active-learning corrections.
Trajectory: random-init decoder (true zero start) → Phase B 7.60% (@1040 ms).

Usage (NeMo)

from nemo.collections.asr.models import ASRModel
m = ASRModel.restore_from("shenava-koochik-0.9.nemo").cuda().eval()
m.encoder.set_default_att_context_size([70, 0])      # 0 ms (real-time); or [70,13] for best WER
print(m.transcribe(["clip.wav"])[0].text)

[70,0]=0 ms · [70,1]=80 ms · [70,6]=480 ms · [70,13]=1040 ms (1 encoder frame = 80 ms, FastConformer subsampling 8).

Notes

Version 0.9 — Phase B on 300 human-verified gold. The v1.0 release follows a larger active-learning gold round (the 6K worst-disagreement clips now under review on Argilla).
The CTC head is the deployment head for real-time/WebGPU; RNNT is the higher-accuracy offline/rescorer head.
Companion offline flagship: Reza2kn/shenava-fa-fastconformer-115m (7.29% WER). Tiny streaming sibling: Shenava-Rizeh-0.9 (32M).

Citation

@misc{shenava_koochik_2026,
  title  = {Shenava-Koochik: Persian Cache-Aware Streaming ASR (114M)},
  author = {Sayar, Reza},
  year   = {2026},
  howpublished = {Hugging Face},
  url    = {https://huggingface.co/Reza2kn/Shenava-Koochik-0.9}
}

Downloads last month: -

Model tree for Reza2kn/Shenava-Koochik-0.9

Base model

nvidia/stt_en_fastconformer_hybrid_large_streaming_multi

Finetuned

(2)

this model

Quantizations

1 model