Nemotron-3.5-ASR-Streaming-Multilingual-0.6B β€” LiteRT (INT8)

Cache-aware streaming multilingual speech recognition. A 0.6 B FastConformer-RNNT encoder with a 128-slot language prompt, exported to LiteRT (.tflite) with channelwise dynamic INT8 encoder weights β€” the smallest Android build (~687 MB). For best quality across all languages, use the FP16 build.

  • Architecture: cache-aware FastConformer encoder (24 layers, 1024 hidden, 8Γ— subsampling) + RNN-T decoder/joint
  • Streaming: 320 ms chunk, 240 ms lookahead, left attention context 56, right context 3
  • Languages: 100+ via the prompt dictionary (languages.json)
  • Audio: 16 kHz mono, 128-bin log-mel front end

Model

Parameters ~0.6 B
Format LiteRT / TFLite (3-graph: encoder + decoder + joint)
Precision INT8 (channelwise dynamic, encoder) + FP32 decoder/joint
Bundle size ~687 MB
Sample rate 16 kHz mono
Chunk / lookahead 320 ms / 240 ms

Files

File Size Description
nemotron-multilingual-encoder.tflite ~594 MB Cache-aware FastConformer encoder (INT8 weights)
nemotron-multilingual-decoder.tflite ~60 MB RNN-T prediction network (FP32)
nemotron-multilingual-joint.tflite ~38 MB RNN-T joint network (FP32)
io_map.json ~4 KB 22-port I/O wiring (inputs, outputs, carried caches)
config.json <1 KB Model + streaming config (mel, chunk, cache sizes)
languages.json ~2 KB Locale β†’ prompt-slot dictionary (128 slots)
vocab.json ~230 KB 13 087-token BPE vocabulary
*_recipe.json <1 KB ai_edge_quantizer INT8 recipe

Performance & runtime requirement

Runtime note. The channelwise-INT8 FULLY_CONNECTED ops require an Android NNAPI / XNNPACK delegate; the plain desktop LiteRT CPU interpreter cannot allocate_tensors() on the INT8 encoder (fully_connected.cc:215 … failed to prepare). This build is intended for on-device Android with a delegate. Quality is therefore validated on-device, not on desktop CPU.

Reference: the equivalent ONNX INT8 build (per-channel) on FLEURS, 320 ms, n=30 β€” INT8 is near-lossless for Arabic / Hindi / Japanese but costs ~+5.6 WER on English and ~+2 on French versus FP16. Use this build when on-device size matters most; otherwise prefer FP16.

Usage

# On Android, load through a delegate (NNAPI / XNNPACK) β€” required for the INT8 encoder.
from ai_edge_litert.interpreter import Interpreter, load_delegate

enc = Interpreter(
    model_path="nemotron-multilingual-encoder.tflite",
    experimental_delegates=[load_delegate("libnnapi_delegate.so")],  # or XNNPACK
)
enc.allocate_tensors()
# io_map.json describes the 22 ports: audio/mel input, language-prompt slot,
# carried encoder caches (attention / conv / pre-cache), and emitted features.

Production streaming, delegate selection, cache management and RNN-T greedy decoding are handled by the speech-android SDK.

Source

Converted from nvidia/nemotron-3.5-asr-streaming-0.6b (NVIDIA NeMo) via ai-edge-torch. Licensed under the NVIDIA Open Model License.

Related models

Variant Repo
ONNX Β· FP16 soniqo/…-ONNX-FP16
ONNX Β· INT8 soniqo/…-ONNX-INT8
LiteRT Β· FP16 soniqo/…-LiteRT-FP16
LiteRT Β· INT8 (this) soniqo/Nemotron-3.5-ASR-Streaming-Multilingual-0.6B-LiteRT-INT8

Links

Downloads last month
20
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for soniqo/Nemotron-3.5-ASR-Streaming-Multilingual-0.6B-LiteRT-INT8

Finetuned
(9)
this model

Collection including soniqo/Nemotron-3.5-ASR-Streaming-Multilingual-0.6B-LiteRT-INT8