You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Nemotron 3.5 ASR Streaming Multilingual 0.6B — CoreML

To grant access please join the server https://discord.gg/S6m4ET3pX and message Sisyphu

CoreML / Apple Neural Engine ships of nemotron-3.5-asr-streaming-0.6b (Conformer encoder + RNN-T decoder), optimized for on-device streaming ASR on Apple Silicon. Benchmarked on Apple M5 Pro / macOS 26.5.

Built on the 2026-05-29 base-checkpoint update.

Two models × 4 latency tiers = 8 bundles.

latin/ — one Latin-script-pruned vocab (2828 tokens) shared by en / es / fr / it / pt / de (smaller, faster joint).
multilingual/ — the full 13087-token vocab covering every language, including zh / ja (and 100+ more via prompt_id).

Each at four chunk sizes — 0.56 s / 1 s / 2 s / 4 s — trading latency for throughput. Pick the folder by script; pass the exact language at inference (--language de-DE). FluidAudio's downloader auto-routes the language to the right folder. Per-language results are in the table below and in manifest.json.

Ship matrix (per-file RTFx, single-stream batch=1)

RTFx = real-time factor (audio-seconds processed per wall-second; higher is faster). WER for Latin-script languages, CER for zh/ja (no word boundaries). All numbers are FLEURS test, full splits (see methodology). The Folder column is which bundle serves that language — the en/es/fr/it/pt/de rows are all the same latin/ model measured per language; zh/ja and Multilingual are the same multilingual/ model.

Language	Folder	Vocab	0.56 s (560 ms) ‡	1 s (1120 ms)	2 s (2240 ms) ⭐	4 s (4480 ms)	Test set
English	`latin`	2828	58 (9.43%)	103 (8.89%)	130 (8.96%)	122 (9.02%)	FLEURS en_us
Spanish	`latin`	2828	58 (4.95%)	106 (4.76%)	140 (4.80%)	136 (4.77%)	FLEURS es_419
French	`latin`	2828	57 (9.68%)	105 (9.44%)	130 (9.52%)	124 (9.42%)	FLEURS fr_fr
Italian	`latin`	2828	59 (5.68%)	109 (5.45%)	147 (5.41%)	150 (5.40%)	FLEURS it_it
Portuguese	`latin`	2828	59 (6.38%)	108 (6.11%)	141 (6.14%)	141 (6.18%)	FLEURS pt_br
German	`latin`	2828	59 (10.83%)	107 (9.78%)	144 (9.83%)	142 (9.83%)	FLEURS de_de
Chinese	`multilingual`	13087	22 (19.48% C)	27 (18.75% C)	89 (18.57% C)	90 (18.05% C)	FLEURS cmn_hans_cn
Japanese	`multilingual`	13087	21 (14.61% C)	26 (13.77% C)	84 (13.79% C)	89 (13.82% C)	FLEURS ja_jp
Multilingual	`multilingual`	13087	23 (9.15%)	71 (8.64%)	80 (8.76%)	78 (8.78%)	FLEURS en_us

‡ 560 ms is the lowest-latency tier but off the trained attention tiling — lower throughput and a small quality cost vs 1120 ms. Use 1120 ms+ unless sub-second latency is required.

Full-vocab models (zh / ja / multilingual) are tier-sensitive. The 13087-vocab joint matmul only fits the ANE working-set efficiently at the 2 s tier. At 560 ms the per-chunk joint overhead dominates and throughput collapses to ≈ 21–23 RTFx; use the 2 s tier for zh/ja/multilingual (zh/ja ≈ 84–90, multilingual-en ≈ 80). Throughput at 1 s depends on output density — sparse Latin text (multilingual-en ≈ 71 RTFx) fares far better than dense CJK (zh/ja ≈ 26), since CJK hits the big joint on more decode steps. The Latin-script ships (small joint) are fast at every tier.

Which tier to use

2 s (2240 ms) is the recommended default for every model. Latin-script ships run ≈ 130–150 RTFx; zh/ja/multilingual peak here at ≈ 84–90 RTFx. WER/CER is at or near its best, at 2.5 s latency.
1 s (1120 ms) for lower latency (1.25 s) on the Latin-script ships at near-full quality (≈ 103–109 RTFx). Avoid for zh/ja/multilingual (≈ 26 RTFx).
0.56 s (560 ms) only when sub-second latency is mandatory; off the trained tiling, so throughput and quality both dip. Not recommended for zh/ja/multilingual (≈ 21–22 RTFx).
4 s (4480 ms) for offline/long-form. Within noise of 2 s for the Latin-script ships, so 2 s usually dominates.

Recipe

All ships share: LAYERPOS [42,13] mixed-precision encoder (first/last 3 Conformer layers INT8, middle 18 layers 6-bit palettized — ~55% encoder size cut vs FP16, WER-neutral) + B1 decoder⊕joint fusion + triple-stage pipelining.

Vocab handling differs by script:

Latin-script languages (en/es/fr/it/pt/de) share one Latin-script-pruned joint — the keep-set is derived from the writing system (all Latin + shared punctuation/digit tokens kept; CJK/Hangul/Cyrillic/Arabic/etc. dropped), not from any test corpus. 2828 tokens, ~5× smaller joint, no test-set overfit and no in-script OOV. One model file serves all six languages.
Chinese / Japanese / multilingual keep the full 13087-vocab joint — no pruning, no OOV, full character coverage.

The encoder is shared across all languages (a multilingual encoder that selects language via prompt_id) and is byte-identical across the Latin-script and full-vocab ships at each tier — only the decode stack differs.

Usage (FluidAudio)

Each <model>/<tier>ms/ directory is a self-contained bundle. Pick the folder by script (latin for en/es/fr/it/pt/de, multilingual for everything else) and pass the exact language:

fluidaudiocli nemotron-multilingual-transcribe \
    --input audio.wav \
    --model-dir latin/2240ms \
    --language de-DE

The FluidAudio auto-downloader routes --language to the correct folder automatically. Models are shipped as compiled .mlmodelc (immediate load on Apple Silicon).

Folder layout

<model>/<tier>ms/
  preprocessor.mlmodelc
  encoder.mlmodelc          # LAYERPOS [42,13], byte-identical across both models per tier
  decoder.mlmodelc
  joint.mlmodelc
  decoder_joint.mlmodelc    # B1 fusion (default decode path)
  metadata.json
  tokenizer.json

<model> ∈ {latin, multilingual}; <tier> ∈ {560, 1120, 2240, 4480}. latin serves en/es/fr/it/pt/de (shared Latin-script vocab); multilingual serves zh/ja and 100+ languages via prompt_id (full vocab). A top-level manifest.json indexes both models, all tiers, and per-language benchmark numbers.

iOS 17

The default latin/ and multilingual/ bundles target iOS 18+ (they use an iOS 18-only quantization op). A parallel ios17/ tree (ios17/latin/<tier>ms/, ios17/multilingual/<tier>ms/) mirrors them for iOS 17, built from the same recipe re-targeted to iOS 17. WER is identical; on iOS 18 hardware the iOS 17 build runs ~4% slower (it uses the older dequant op), which is why both are shipped. Use ios17/ only if you need iOS 17 support.

Notes

Latin-script ships are domain-general. The vocab keep-set is defined by the Latin writing system, not derived from any evaluation corpus, so there is no test-set overfit and no out-of-vocabulary loss for any Latin-script text.
zh/ja use the full-vocab model (no pruned keep-set), so they have no OOV limitation and cover the full character inventory — at the cost of throughput below the 2 s tier (use 2 s).
The multilingual full-vocab model (13087) supports 100+ languages via prompt_id — use it when broad coverage matters more than per-language speed.

Benchmark methodology

Apple M5 Pro, macOS 26.5, coremltools 9.0, CoreML iOS18 target, .cpuAndNeuralEngine routing. Single-stream, batch=1, per-file sum-aggregate RTFx (matches the Open ASR Leaderboard convention). All languages evaluated on FLEURS test, full splits. WER for Latin-script languages, CER for zh/ja, via HuggingFace normalization. No inverse text normalization is applied, so FLEURS' digit-bearing utterances inflate WER by ~1–2 pp relative to number-normalized references; FLEURS is also multi-domain, so these numbers run higher than LibriSpeech/MLS would for the same model.

License & attribution

Derived from the base model nemotron-3.5-asr-streaming-0.6b, governed by the NVIDIA Software and Model Evaluation License. Weights are quantized/pruned post-training only — no retraining, no fine-tuning, no calibration-data fitting.

Downloads last month: 5

Model tree for FluidInference/Nemotron-3.5-ASR-Streaming-Multilingual-0.6b-CoreML

Base model

nvidia/nemotron-3.5-asr-streaming-0.6b

Finetuned

(1)

this model