Instructions to use soniqo/Nemotron-3.5-ASR-Streaming-Multilingual-0.6B-LiteRT-INT8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- LiteRT
How to use soniqo/Nemotron-3.5-ASR-Streaming-Multilingual-0.6B-LiteRT-INT8 with LiteRT:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
Nemotron-3.5-ASR-Streaming-Multilingual-0.6B β LiteRT (INT8)
Cache-aware streaming multilingual speech recognition. A 0.6 B FastConformer-RNNT
encoder with a 128-slot language prompt, exported to LiteRT (.tflite) with
channelwise dynamic INT8 encoder weights β the smallest Android build (~687 MB). For best
quality across all languages, use the
FP16 build.
- Architecture: cache-aware FastConformer encoder (24 layers, 1024 hidden, 8Γ subsampling) + RNN-T decoder/joint
- Streaming: 320 ms chunk, 240 ms lookahead, left attention context 56, right context 3
- Languages: 100+ via the prompt dictionary (
languages.json) - Audio: 16 kHz mono, 128-bin log-mel front end
Model
| Parameters | ~0.6 B |
| Format | LiteRT / TFLite (3-graph: encoder + decoder + joint) |
| Precision | INT8 (channelwise dynamic, encoder) + FP32 decoder/joint |
| Bundle size | ~687 MB |
| Sample rate | 16 kHz mono |
| Chunk / lookahead | 320 ms / 240 ms |
Files
| File | Size | Description |
|---|---|---|
nemotron-multilingual-encoder.tflite |
~594 MB | Cache-aware FastConformer encoder (INT8 weights) |
nemotron-multilingual-decoder.tflite |
~60 MB | RNN-T prediction network (FP32) |
nemotron-multilingual-joint.tflite |
~38 MB | RNN-T joint network (FP32) |
io_map.json |
~4 KB | 22-port I/O wiring (inputs, outputs, carried caches) |
config.json |
<1 KB | Model + streaming config (mel, chunk, cache sizes) |
languages.json |
~2 KB | Locale β prompt-slot dictionary (128 slots) |
vocab.json |
~230 KB | 13 087-token BPE vocabulary |
*_recipe.json |
<1 KB | ai_edge_quantizer INT8 recipe |
Performance & runtime requirement
Runtime note. The channelwise-INT8
FULLY_CONNECTEDops require an Android NNAPI / XNNPACK delegate; the plain desktop LiteRT CPU interpreter cannotallocate_tensors()on the INT8 encoder (fully_connected.cc:215 β¦ failed to prepare). This build is intended for on-device Android with a delegate. Quality is therefore validated on-device, not on desktop CPU.
Reference: the equivalent ONNX INT8 build (per-channel) on FLEURS, 320 ms, n=30 β INT8 is near-lossless for Arabic / Hindi / Japanese but costs ~+5.6 WER on English and ~+2 on French versus FP16. Use this build when on-device size matters most; otherwise prefer FP16.
Usage
# On Android, load through a delegate (NNAPI / XNNPACK) β required for the INT8 encoder.
from ai_edge_litert.interpreter import Interpreter, load_delegate
enc = Interpreter(
model_path="nemotron-multilingual-encoder.tflite",
experimental_delegates=[load_delegate("libnnapi_delegate.so")], # or XNNPACK
)
enc.allocate_tensors()
# io_map.json describes the 22 ports: audio/mel input, language-prompt slot,
# carried encoder caches (attention / conv / pre-cache), and emitted features.
Production streaming, delegate selection, cache management and RNN-T greedy decoding are handled by the speech-android SDK.
Source
Converted from nvidia/nemotron-3.5-asr-streaming-0.6b (NVIDIA NeMo) via ai-edge-torch. Licensed under the NVIDIA Open Model License.
Related models
| Variant | Repo |
|---|---|
| ONNX Β· FP16 | soniqo/β¦-ONNX-FP16 |
| ONNX Β· INT8 | soniqo/β¦-ONNX-INT8 |
| LiteRT Β· FP16 | soniqo/β¦-LiteRT-FP16 |
| LiteRT Β· INT8 (this) | soniqo/Nemotron-3.5-ASR-Streaming-Multilingual-0.6B-LiteRT-INT8 |
Links
- speech-android β Android SDK
- speech-core β on-device inference core (C++)
- soniqo.audio β website
- blog β blog
- Downloads last month
- 20
Model tree for soniqo/Nemotron-3.5-ASR-Streaming-Multilingual-0.6B-LiteRT-INT8
Base model
nvidia/nemotron-3.5-asr-streaming-0.6b