Instructions to use Reza2kn/Shenava-Koochik-0.9 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- NeMo
How to use Reza2kn/Shenava-Koochik-0.9 with NeMo:
import nemo.collections.asr as nemo_asr asr_model = nemo_asr.models.ASRModel.from_pretrained("Reza2kn/Shenava-Koochik-0.9") transcriptions = asr_model.transcribe(["file.wav"]) - Notebooks
- Google Colab
- Kaggle
Shenava-Koochik 0.9 β Persian Cache-Aware Streaming ASR (114M)
A 114M-param cache-aware, multi-latency streaming Persian (Farsi) ASR model β FastConformer-Hybrid (RNNT + CTC), 16 kHz. Built for fully offline, on-device, real-time captioning (WebGPU / WASM / NeMo), part of the VisualEars project (SLT 2026).
One model serves the entire latencyβaccuracy curve (0 / 80 / 480 / 1040 ms) β pick your operating point at inference time, no re-training. Its tiny sibling is Shenava-Rizeh-0.9 (32M).
π Results β Golden6669 (held-out gold Persian eval)
Evaluated on Reza2kn/visualears-golden-6669 (6,669 clips, official Persian normalizer), RNNT head:
att_context_size |
Latency | WER | CER |
|---|---|---|---|
[70, 0] |
0 ms (real-time) | 8.57% | 2.35% |
[70, 1] |
80 ms | 8.29% | 2.13% |
[70, 6] |
480 ms | 7.71% | 1.81% |
[70, 13] |
1040 ms | 7.60% | 1.77% |
Boundary-forgiven WER ranges 7.18% (0 ms) β 6.13% (1040 ms).
The streaming 114M @1040 ms (7.60% / 1.77%) essentially matches the offline flagship shenava-fa-fastconformer-115m (7.29% / 1.63%) within ~0.3 pp β but it streams (one checkpoint, 0 β 1040 ms). At zero lookahead it does 8.57% WER / 2.35% CER at true real-time, fully on-device β sub-9% real-time Persian streaming ASR that didn't exist before. The 0 β 1040 ms span is only 0.97 pp.
How it was trained
- Base:
nvidia/stt_en_fastconformer_hybrid_large_streaming_multi(English cache-aware multi-latency streaming), persianized by swapping in a Persian BPE-1024 tokenizer and reinitializing the decoder + joint (encoder kept). - Multi-latency: trained with
att_context_size = [[70,13],[70,6],[70,1],[70,0]](chunked_limited) so a single checkpoint covers 0 / 80 / 480 / 1040 ms. - Phase A: ~7,386 h / 3.66M clips of cleaned, teacher-pseudo-labeled Persian (
Reza2kn/visualears-persian-pseudo-asr). - Phase B: gold-anchor fine-tune on human-verified gold + active-learning corrections.
- Trajectory: random-init decoder (true zero start) β Phase B 7.60% (@1040 ms).
Usage (NeMo)
from nemo.collections.asr.models import ASRModel
m = ASRModel.restore_from("shenava-koochik-0.9.nemo").cuda().eval()
m.encoder.set_default_att_context_size([70, 0]) # 0 ms (real-time); or [70,13] for best WER
print(m.transcribe(["clip.wav"])[0].text)
[70,0]=0 ms Β· [70,1]=80 ms Β· [70,6]=480 ms Β· [70,13]=1040 ms (1 encoder frame = 80 ms, FastConformer subsampling 8).
Notes
- Version 0.9 β Phase B on 300 human-verified gold. The v1.0 release follows a larger active-learning gold round (the 6K worst-disagreement clips now under review on Argilla).
- The CTC head is the deployment head for real-time/WebGPU; RNNT is the higher-accuracy offline/rescorer head.
- Companion offline flagship:
Reza2kn/shenava-fa-fastconformer-115m(7.29% WER). Tiny streaming sibling:Shenava-Rizeh-0.9(32M).
Citation
@misc{shenava_koochik_2026,
title = {Shenava-Koochik: Persian Cache-Aware Streaming ASR (114M)},
author = {Sayar, Reza},
year = {2026},
howpublished = {Hugging Face},
url = {https://huggingface.co/Reza2kn/Shenava-Koochik-0.9}
}
- Downloads last month
- -