nemotron-speech-streaming-en-0.6b (ONNX, INT4)

Streaming English ASR model from NVIDIA NeMo, exported to ONNX and quantised to INT4 for CPU inference. No GPU required.

Originally packaged by Microsoft / GitHub Copilot under MIT. The underlying ASR weights are under the NVIDIA Open Model License, which permits modification and redistribution. The bundled Silero VAD is MIT.

What's in here

File Purpose
encoder.onnx + encoder.onnx.data Chunked transformer encoder, 24 layers x 1024 hidden
decoder.onnx + decoder.onnx.data 2-layer LSTM predictor, 640 hidden
joint.onnx + joint.onnx.data RNN-T joint network, vocab 1024 plus blank
silero_vad.onnx Voice activity detector
vocab.txt / tokenizer.json SentencePiece vocabulary (1025 tokens incl. blank)
genai_config.json, model_config.json, audio_processor_config.json Runtime configuration

Audio pipeline

  • 16 kHz mono PCM
  • 128-bin log-mel features, n_fft=512, hop=160, window=400, Hann, preemphasis 0.97
  • 560 ms chunks (8960 samples), 9-frame mel pre-encode cache
  • Encoder produces 7 timesteps per chunk (subsampling factor 8)
  • Greedy RNN-T decode with max_symbols_per_step=10

System requirements

  • Any modern x86-64 CPU with AVX2 (Windows, Linux, or macOS via ONNX Runtime)
  • About 1.5 GB of RAM at runtime
  • About 700 MB on disk

Desktop app

A ready-to-use Windows tray app that streams from your microphone into this model and types the result into the focused window: github.com/Garnet-Owl/nemo-voice-typing

Quickstart (Python)

import onnxruntime as ort
encoder = ort.InferenceSession("encoder.onnx")
decoder = ort.InferenceSession("decoder.onnx")
joint   = ort.InferenceSession("joint.onnx")
# See genai_config.json for the per-session input shapes and cache layout.
Downloads last month
15
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support