Nemotron Speech Streaming 0.6B β€” LiteRT

Cache-aware FastConformer + RNN-T for sub-second streaming ASR. 80 ms chunks, in-pod state.

Part of the soniqo.audio speech toolkit β€” an open, runtime-portable stack for speech AI. This bundle is the LiteRT export, designed to plug into the abstract interfaces in speech-core (C++ voice-agent orchestration library). Browse all LiteRT bundles in the soniqo LiteRT collection.

Use cases on soniqo.audio

Cache-aware streaming ASR exported to LiteRT for sub-second real-time transcription. Three split graphs β€” encoder, decoder, joint β€” wired in a per-session loop that holds the FastConformer KV cache and the RNN-T LSTM state across 80 ms chunks. The host owns the loop and the cache state; LiteRT owns the static tensor programs.

Files

File Description
nemotron-streaming-encoder.tflite INT8 cache-aware FastConformer encoder
nemotron-streaming-decoder.tflite FP32 RNN-T prediction network (LSTM)
nemotron-streaming-joint.tflite FP32 joint network
vocab.json SentencePiece BPE vocab
config.json Mel + chunk + cache shape spec
nemotron-streaming-encoder_recipe.json Quantizer recipe

Streaming contract

audio chunk (80 ms, 16 kHz)
    β”‚
    β–Ό
mel fbank (80 bins) ──► encoder + cached K/V ──► encoded frame
                                                         β”‚
                                                         β–Ό
                                               decoder (LSTM)
                                                         β”‚
                                                         β–Ό
                                                  joint ──► BPE token

The C++ worker owns the cache and LSTM state across chunks; LiteRT owns the static tensor programs. Cache shapes are published in config.json so the worker can pre-allocate and reset state per session without inspecting the bundle.

Validation

End-to-end verified against a 12.56 s reference utterance:

  • First-partial latency 0.42 s
  • p50 chunk compute 79.9 ms on CCX23 CPU (RTF β‰ˆ 1.0Γ— per session)
  • Transcript matches the upstream PyTorch reference to within boundary-artifact noise (in fact the LiteRT path's 'The quick brown fox jumps over the lazy dog' came out cleaner than the Python validator on this utterance)

Source

Exported from nvidia/nemotron-speech-streaming-en-0.6b. On macOS the conversion runs as a two-stage pipeline (trace

  • LiteRT conversion in separate processes) because NeMo and litert_torch fight over native thread pools when they share one interpreter.

Ecosystem

  • soniqo.audio β€” use-case explorer (transcription, voice cloning, live ASR, voice agents).
  • speech-core β€” C++ orchestration library for voice agents. Abstract STTInterface / TTSInterface / VADInterface / EnhancerInterface; LiteRT implementations plug straight into the interfaces.
  • speech-swift β€” Apple Silicon MLX companion runtime (model-specific MLX bundles linked above where applicable).
  • speech-android β€” Android SDK consuming on-device LiteRT bundles.

Other LiteRT models in this collection

ASR / Transcription

VAD / Diarization

TTS / Voice Cloning

License

This bundle inherits the upstream model license (cc-by-4.0). See the linked base_model repository for the full terms.

Downloads last month
174
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for soniqo/Nemotron-Speech-Streaming-LiteRT

Finetuned
(8)
this model

Collection including soniqo/Nemotron-Speech-Streaming-LiteRT