nemotron-speech-streaming-en-0.6b (ONNX, INT4)

Streaming English ASR model from NVIDIA NeMo, exported to ONNX and quantised to INT4 for CPU inference. No GPU required.

Originally packaged by Microsoft / GitHub Copilot under MIT. The underlying ASR weights are under the NVIDIA Open Model License, which permits modification and redistribution. The bundled Silero VAD is MIT.

What's in here

File	Purpose
`encoder.onnx` + `encoder.onnx.data`	Chunked transformer encoder, 24 layers x 1024 hidden
`decoder.onnx` + `decoder.onnx.data`	2-layer LSTM predictor, 640 hidden
`joint.onnx` + `joint.onnx.data`	RNN-T joint network, vocab 1024 plus blank
`silero_vad.onnx`	Voice activity detector
`vocab.txt` / `tokenizer.json`	SentencePiece vocabulary (1025 tokens incl. blank)
`genai_config.json`, `model_config.json`, `audio_processor_config.json`	Runtime configuration

Audio pipeline

16 kHz mono PCM
128-bin log-mel features, n_fft=512, hop=160, window=400, Hann, preemphasis 0.97
560 ms chunks (8960 samples), 9-frame mel pre-encode cache
Encoder produces 7 timesteps per chunk (subsampling factor 8)
Greedy RNN-T decode with max_symbols_per_step=10

System requirements

Any modern x86-64 CPU with AVX2 (Windows, Linux, or macOS via ONNX Runtime)
About 1.5 GB of RAM at runtime
About 700 MB on disk

Desktop app

A ready-to-use Windows tray app that streams from your microphone into this model and types the result into the focused window: github.com/Garnet-Owl/nemo-voice-typing

Quickstart (Python)

import onnxruntime as ort
encoder = ort.InferenceSession("encoder.onnx")
decoder = ort.InferenceSession("decoder.onnx")
joint   = ort.InferenceSession("joint.onnx")
# See genai_config.json for the per-session input shapes and cache layout.

Downloads last month: 15