Nemotron 3.5 ASR Multi-Encoder (INT4)

INT4 quantized ONNX export of nvidia/nemotron-3.5-asr-streaming-0.6b with 5 configurable encoder chunk sizes for runtime latency/accuracy trade-offs.

Available Encoders

Model Chunk Size att_context_size window_size (mel frames)
encoder_80ms.onnx 80 ms [70, 0] 17
encoder_160ms.onnx 160 ms [70, 1] 25
encoder_320ms.onnx 320 ms [70, 3] 41
encoder_560ms.onnx 560 ms [70, 6] 65
encoder_1120ms.onnx 1120 ms [70, 13] 121

The decoder (decoder.onnx) and joint network (joint.onnx) are shared across all encoders.

Choose the encoder that fits your latency budget:

  • 80 ms — ultra-low latency, ideal for interactive voice agents
  • 160 ms — very low latency
  • 320 ms — balanced
  • 560 ms — standard, good accuracy (default)
  • 1120 ms — highest accuracy, higher latency

Export Method

Based on sherpa-onnx's export script (att_context_size adjustment + MatMulNBits INT4 with block_size=128).

Language Support

Supports 40 language-locales via language-ID prompt conditioning.

Downloads last month
238
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for jeffpeng3/nemotron-3.5-asr-multi-encoder-int4

Quantized
(19)
this model

Dataset used to train jeffpeng3/nemotron-3.5-asr-multi-encoder-int4