Magpie-TTS-Multilingual-357M-CoreML-8bit

speech-swift — Apple SDK
soniqo.audio — website
blog — blog

Core ML port of NVIDIA Magpie-TTS Multilingual 357M, an autoregressive multi-codebook TTS model over the Nano-Codec 22 kHz / 1.89 kbps / 21.5 fps vocoder, quantized to INT8 weight-only for Apple Silicon.

Core ML INT8 bundle for iOS / macOS. Four compiled .mlmodelc packages with scatter-based KV cache (fully static graph, ANE-friendly).

Model


Total parameters	357 M (text encoder 99 M + decoder 90 M + LocalTransformer 1 M + NanoCodec 62 M + audio embeddings)
Architecture	Causal Transformer encoder (6L, d=768) + Causal Transformer decoder (12L, d=768) + LocalTransformer codebook AR (1L, d=256) + Causal HiFi-GAN decoder
Audio	8 codebooks × 2024 codes, 22.05 kHz mono, 21.5 fps
Languages	EN, ES, DE, FR, IT, VI, ZH, HI, JA
Speakers	5 baked (John, Sofia, Aria, Jason, Leo)
Bundle size	342 MB on disk
Layout	4-bundle Core ML (text_encoder / decoder_prefill / decoder_step / nanocodec_decoder)

Files

File	Size	Description
`text_encoder.mlmodelc/`	97 MB	text encoder (INT8)
`decoder_prefill.mlmodelc/`	87 MB	decoder prefill (INT8)
`decoder_step.mlmodelc/`	97 MB	decoder step (INT8)
`nanocodec_decoder.mlmodelc/`	61 MB	nanocodec decoder (FP16)
`manifest.json`	<1 KB	SHA256 + sizes manifest

The 4-bundle layout splits the model into:

text_encoder — runs once per utterance over the phoneme sequence
decoder_prefill — batch-prefills the 110-step baked speaker context into the KV cache (~10× faster than a sequential cold start)
decoder_step — single AR step over the next audio frame; shares weights with decoder_prefill
nanocodec_decoder — codes → 22.05 kHz waveform (always FP16; per FluidInference's data, quantizing the codec yields no runtime savings)

Round-trip validation

End-to-end TTS → faster-whisper large-v3 ASR on a held-out sentence per language (Character Error Rate):

Language	en	es	de	fr	it	vi	zh	hi
CER	0.00%	0.00%	0.00%	0.00%	0.00%	<8% (tone)	<2% (1 added interjection)	mixed-script Whisper artifact

Usage

import json
from pathlib import Path
import mlx.core as mx

# 1. Tokenize text in your app (Swift) — see speech-swift's KokoroTTS
#    pattern. For Japanese, use Apple's CFStringTokenizer + katakana → IPA.
# 2. Load the 3 sub-models and run the AR loop.
from huggingface_hub import snapshot_download
bundle = Path(snapshot_download("aufklarer/Magpie-TTS-Multilingual-357M-CoreML-8bit"))

# Production usage: see https://github.com/soniqo/speech-swift.

The production Swift integration handles tokenization, the AR loop, KV-cache management, and audio rendering. This HuggingFace bundle exists for researchers and SDK developers building atop the MLX weights directly.

Source

Upstream weights: nvidia/magpie_tts_multilingual_357m (NVIDIA Open Model License)
Codec: nvidia/nemo-nano-codec-22khz-1.89kbps-21.5fps
Paper: NanoCodec: Towards High-Quality Ultra Fast Speech LLM Inference

License

NVIDIA Open Model License — inherited from upstream Magpie-TTS Multilingual. Suitable for commercial use; please review the license text linked above.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for aufklarer/Magpie-TTS-Multilingual-357M-CoreML-8bit

Base model

nvidia/magpie_tts_multilingual_357m

Finetuned

(5)

this model

Collection including aufklarer/Magpie-TTS-Multilingual-357M-CoreML-8bit

CoreML Speech Models

Collection

Speech AI models for Apple Neural Engine via CoreML. iOS/macOS ready. ASR, TTS, VAD, diarization. • 23 items • Updated 3 days ago • 3

Paper for aufklarer/Magpie-TTS-Multilingual-357M-CoreML-8bit

NanoCodec: Towards High-Quality Ultra Fast Speech LLM Inference

Paper • 2508.05835 • Published Aug 7, 2025