Nemotron-3.5-ASR-Streaming-Multilingual-0.6B — LiteRT (INT8)

Cache-aware streaming multilingual speech recognition. A 0.6 B FastConformer-RNNT encoder with a 128-slot language prompt, exported to LiteRT (.tflite) with channelwise dynamic INT8 encoder weights — the smallest Android build (~687 MB). For best quality across all languages, use the FP16 build.

Architecture: cache-aware FastConformer encoder (24 layers, 1024 hidden, 8× subsampling) + RNN-T decoder/joint
Streaming: 320 ms chunk, 240 ms lookahead, left attention context 56, right context 3
Languages: 100+ via the prompt dictionary (languages.json)
Audio: 16 kHz mono, 128-bin log-mel front end

Model


Parameters	~0.6 B
Format	LiteRT / TFLite (3-graph: encoder + decoder + joint)
Precision	INT8 (channelwise dynamic, encoder) + FP32 decoder/joint
Bundle size	~687 MB
Sample rate	16 kHz mono
Chunk / lookahead	320 ms / 240 ms

Files

File	Size	Description
`nemotron-multilingual-encoder.tflite`	~594 MB	Cache-aware FastConformer encoder (INT8 weights)
`nemotron-multilingual-decoder.tflite`	~60 MB	RNN-T prediction network (FP32)
`nemotron-multilingual-joint.tflite`	~38 MB	RNN-T joint network (FP32)
`io_map.json`	~4 KB	22-port I/O wiring (inputs, outputs, carried caches)
`config.json`	<1 KB	Model + streaming config (mel, chunk, cache sizes)
`languages.json`	~2 KB	Locale → prompt-slot dictionary (128 slots)
`vocab.json`	~230 KB	13 087-token BPE vocabulary
`*_recipe.json`	<1 KB	ai_edge_quantizer INT8 recipe

Performance & runtime requirement

Runtime note. The channelwise-INT8 FULLY_CONNECTED ops require an Android NNAPI / XNNPACK delegate; the plain desktop LiteRT CPU interpreter cannot allocate_tensors() on the INT8 encoder (fully_connected.cc:215 … failed to prepare). This build is intended for on-device Android with a delegate. Quality is therefore validated on-device, not on desktop CPU.

Reference: the equivalent ONNX INT8 build (per-channel) on FLEURS, 320 ms, n=30 — INT8 is near-lossless for Arabic / Hindi / Japanese but costs ~+5.6 WER on English and ~+2 on French versus FP16. Use this build when on-device size matters most; otherwise prefer FP16.

Usage

# On Android, load through a delegate (NNAPI / XNNPACK) — required for the INT8 encoder.
from ai_edge_litert.interpreter import Interpreter, load_delegate

enc = Interpreter(
    model_path="nemotron-multilingual-encoder.tflite",
    experimental_delegates=[load_delegate("libnnapi_delegate.so")],  # or XNNPACK
)
enc.allocate_tensors()
# io_map.json describes the 22 ports: audio/mel input, language-prompt slot,
# carried encoder caches (attention / conv / pre-cache), and emitted features.

Production streaming, delegate selection, cache management and RNN-T greedy decoding are handled by the speech-android SDK.

Source

Converted from nvidia/nemotron-3.5-asr-streaming-0.6b (NVIDIA NeMo) via ai-edge-torch. Licensed under the NVIDIA Open Model License.

Related models

Variant	Repo
ONNX · FP16	soniqo/…-ONNX-FP16
ONNX · INT8	soniqo/…-ONNX-INT8
LiteRT · FP16	soniqo/…-LiteRT-FP16
LiteRT · INT8 (this)	`soniqo/Nemotron-3.5-ASR-Streaming-Multilingual-0.6B-LiteRT-INT8`

Model tree for soniqo/Nemotron-3.5-ASR-Streaming-Multilingual-0.6B-LiteRT-INT8

Base model

nvidia/nemotron-3.5-asr-streaming-0.6b

Finetuned

(9)

this model

Collection including soniqo/Nemotron-3.5-ASR-Streaming-Multilingual-0.6B-LiteRT-INT8

LiteRT

Collection

LiteRT (.tflite) bundles for soniqo.audio. ASR, VAD, diarization, speaker ID, streaming, TTS — served by speech-cloud and speech-core. • 11 items • Updated 1 day ago • 1

soniqo
/

Nemotron-3.5-ASR-Streaming-Multilingual-0.6B-LiteRT-INT8