Nemotron Speech Streaming 0.6B — LiteRT

Cache-aware FastConformer + RNN-T for sub-second streaming ASR. 80 ms chunks, in-pod state.

Part of the soniqo.audio speech toolkit — an open, runtime-portable stack for speech AI. This bundle is the LiteRT export, designed to plug into the abstract interfaces in speech-core (C++ voice-agent orchestration library). Browse all LiteRT bundles in the soniqo LiteRT collection.

Use cases on soniqo.audio

Cache-aware streaming ASR exported to LiteRT for sub-second real-time transcription. Three split graphs — encoder, decoder, joint — wired in a per-session loop that holds the FastConformer KV cache and the RNN-T LSTM state across 80 ms chunks. The host owns the loop and the cache state; LiteRT owns the static tensor programs.

Files

File	Description
`nemotron-streaming-encoder.tflite`	INT8 cache-aware FastConformer encoder
`nemotron-streaming-decoder.tflite`	FP32 RNN-T prediction network (LSTM)
`nemotron-streaming-joint.tflite`	FP32 joint network
`vocab.json`	SentencePiece BPE vocab
`config.json`	Mel + chunk + cache shape spec
`nemotron-streaming-encoder_recipe.json`	Quantizer recipe

Streaming contract

audio chunk (80 ms, 16 kHz)
    │
    ▼
mel fbank (80 bins) ──► encoder + cached K/V ──► encoded frame
                                                         │
                                                         ▼
                                               decoder (LSTM)
                                                         │
                                                         ▼
                                                  joint ──► BPE token

The C++ worker owns the cache and LSTM state across chunks; LiteRT owns the static tensor programs. Cache shapes are published in config.json so the worker can pre-allocate and reset state per session without inspecting the bundle.

Validation

End-to-end verified against a 12.56 s reference utterance:

First-partial latency 0.42 s
p50 chunk compute 79.9 ms on CCX23 CPU (RTF ≈ 1.0× per session)
Transcript matches the upstream PyTorch reference to within boundary-artifact noise (in fact the LiteRT path's 'The quick brown fox jumps over the lazy dog' came out cleaner than the Python validator on this utterance)

Source

Exported from nvidia/nemotron-speech-streaming-en-0.6b. On macOS the conversion runs as a two-stage pipeline (trace

LiteRT conversion in separate processes) because NeMo and litert_torch fight over native thread pools when they share one interpreter.

Ecosystem

soniqo.audio — use-case explorer (transcription, voice cloning, live ASR, voice agents).
speech-core — C++ orchestration library for voice agents. Abstract STTInterface / TTSInterface / VADInterface / EnhancerInterface; LiteRT implementations plug straight into the interfaces.
speech-swift — Apple Silicon MLX companion runtime (model-specific MLX bundles linked above where applicable).
speech-android — Android SDK consuming on-device LiteRT bundles.