Shenava-Koochik 0.9 β€” Persian Cache-Aware Streaming ASR (114M)

A 114M-param cache-aware, multi-latency streaming Persian (Farsi) ASR model β€” FastConformer-Hybrid (RNNT + CTC), 16 kHz. Built for fully offline, on-device, real-time captioning (WebGPU / WASM / NeMo), part of the VisualEars project (SLT 2026).

One model serves the entire latency–accuracy curve (0 / 80 / 480 / 1040 ms) β€” pick your operating point at inference time, no re-training. Its tiny sibling is Shenava-Rizeh-0.9 (32M).

πŸ“Š Results β€” Golden6669 (held-out gold Persian eval)

Evaluated on Reza2kn/visualears-golden-6669 (6,669 clips, official Persian normalizer), RNNT head:

att_context_size Latency WER CER
[70, 0] 0 ms (real-time) 8.57% 2.35%
[70, 1] 80 ms 8.29% 2.13%
[70, 6] 480 ms 7.71% 1.81%
[70, 13] 1040 ms 7.60% 1.77%

Boundary-forgiven WER ranges 7.18% (0 ms) β†’ 6.13% (1040 ms).

The streaming 114M @1040 ms (7.60% / 1.77%) essentially matches the offline flagship shenava-fa-fastconformer-115m (7.29% / 1.63%) within ~0.3 pp β€” but it streams (one checkpoint, 0 β†’ 1040 ms). At zero lookahead it does 8.57% WER / 2.35% CER at true real-time, fully on-device β€” sub-9% real-time Persian streaming ASR that didn't exist before. The 0 β†’ 1040 ms span is only 0.97 pp.

How it was trained

  • Base: nvidia/stt_en_fastconformer_hybrid_large_streaming_multi (English cache-aware multi-latency streaming), persianized by swapping in a Persian BPE-1024 tokenizer and reinitializing the decoder + joint (encoder kept).
  • Multi-latency: trained with att_context_size = [[70,13],[70,6],[70,1],[70,0]] (chunked_limited) so a single checkpoint covers 0 / 80 / 480 / 1040 ms.
  • Phase A: ~7,386 h / 3.66M clips of cleaned, teacher-pseudo-labeled Persian (Reza2kn/visualears-persian-pseudo-asr).
  • Phase B: gold-anchor fine-tune on human-verified gold + active-learning corrections.
  • Trajectory: random-init decoder (true zero start) β†’ Phase B 7.60% (@1040 ms).

Usage (NeMo)

from nemo.collections.asr.models import ASRModel
m = ASRModel.restore_from("shenava-koochik-0.9.nemo").cuda().eval()
m.encoder.set_default_att_context_size([70, 0])      # 0 ms (real-time); or [70,13] for best WER
print(m.transcribe(["clip.wav"])[0].text)

[70,0]=0 ms Β· [70,1]=80 ms Β· [70,6]=480 ms Β· [70,13]=1040 ms (1 encoder frame = 80 ms, FastConformer subsampling 8).

Notes

  • Version 0.9 β€” Phase B on 300 human-verified gold. The v1.0 release follows a larger active-learning gold round (the 6K worst-disagreement clips now under review on Argilla).
  • The CTC head is the deployment head for real-time/WebGPU; RNNT is the higher-accuracy offline/rescorer head.
  • Companion offline flagship: Reza2kn/shenava-fa-fastconformer-115m (7.29% WER). Tiny streaming sibling: Shenava-Rizeh-0.9 (32M).

Citation

@misc{shenava_koochik_2026,
  title  = {Shenava-Koochik: Persian Cache-Aware Streaming ASR (114M)},
  author = {Sayar, Reza},
  year   = {2026},
  howpublished = {Hugging Face},
  url    = {https://huggingface.co/Reza2kn/Shenava-Koochik-0.9}
}
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Reza2kn/Shenava-Koochik-0.9

Finetuned
(2)
this model
Quantizations
1 model