nemotron-speech-streaming-en-0.6b (ONNX, INT4)
Streaming English ASR model from NVIDIA NeMo, exported to ONNX and quantised to INT4 for CPU inference. No GPU required.
Originally packaged by Microsoft / GitHub Copilot under MIT. The underlying ASR weights are under the NVIDIA Open Model License, which permits modification and redistribution. The bundled Silero VAD is MIT.
What's in here
| File | Purpose |
|---|---|
encoder.onnx + encoder.onnx.data |
Chunked transformer encoder, 24 layers x 1024 hidden |
decoder.onnx + decoder.onnx.data |
2-layer LSTM predictor, 640 hidden |
joint.onnx + joint.onnx.data |
RNN-T joint network, vocab 1024 plus blank |
silero_vad.onnx |
Voice activity detector |
vocab.txt / tokenizer.json |
SentencePiece vocabulary (1025 tokens incl. blank) |
genai_config.json, model_config.json, audio_processor_config.json |
Runtime configuration |
Audio pipeline
- 16 kHz mono PCM
- 128-bin log-mel features, n_fft=512, hop=160, window=400, Hann, preemphasis 0.97
- 560 ms chunks (8960 samples), 9-frame mel pre-encode cache
- Encoder produces 7 timesteps per chunk (subsampling factor 8)
- Greedy RNN-T decode with
max_symbols_per_step=10
System requirements
- Any modern x86-64 CPU with AVX2 (Windows, Linux, or macOS via ONNX Runtime)
- About 1.5 GB of RAM at runtime
- About 700 MB on disk
Desktop app
A ready-to-use Windows tray app that streams from your microphone into this model and types the result into the focused window: github.com/Garnet-Owl/nemo-voice-typing
Quickstart (Python)
import onnxruntime as ort
encoder = ort.InferenceSession("encoder.onnx")
decoder = ort.InferenceSession("decoder.onnx")
joint = ort.InferenceSession("joint.onnx")
# See genai_config.json for the per-session input shapes and cache layout.
- Downloads last month
- 15