Magpie-TTS-Multilingual-357M-MLX-8bit

speech-swift — Apple SDK
soniqo.audio — website
blog — blog

MLX port of NVIDIA Magpie-TTS Multilingual 357M, an autoregressive multi-codebook TTS model over the Nano-Codec 22 kHz / 1.89 kbps / 21.5 fps vocoder, quantized to INT8 weight-only for Apple Silicon.

Transparent-quality. ~57% the size of FP16. Use for fidelity-sensitive deployments.

Model


Total parameters	357 M (text encoder 99 M + decoder 90 M + LocalTransformer 1 M + NanoCodec 62 M + audio embeddings)
Architecture	Causal Transformer encoder (6L, d=768) + Causal Transformer decoder (12L, d=768) + LocalTransformer codebook AR (1L, d=256) + Causal HiFi-GAN decoder
Audio	8 codebooks × 2024 codes, 22.05 kHz mono, 21.5 fps
Languages	EN, ES, DE, FR, IT, VI, ZH, HI, JA
Speakers	5 baked (John, Sofia, Aria, Jason, Leo)
Bundle size	410 MB on disk
Layout	4-bundle MLX (text_encoder / decoder_prefill / decoder_step / nanocodec_decoder)

Files

File	Size	Description
`text_encoder/model.safetensors`	108.9 MB	text encoder weights (INT8)
`decoder_prefill/model.safetensors`	129.4 MB	decoder prefill weights (INT8)
`decoder_step/model.safetensors`	129.4 MB	decoder step weights (INT8)
`nanocodec_decoder/model.safetensors`	63.2 MB	nanocodec decoder weights (INT8)
`tokenizer/*.json`	~30 KB each	per-language tokenizer config (8 langs + manifest)
`manifest.json`	<1 KB	SHA256 + sizes manifest

The 4-bundle layout splits the model into:

text_encoder — runs once per utterance over the phoneme sequence
decoder_prefill — batch-prefills the 110-step baked speaker context into the KV cache (~10× faster than a sequential cold start)
decoder_step — single AR step over the next audio frame; shares weights with decoder_prefill
nanocodec_decoder — codes → 22.05 kHz waveform (always FP16; per FluidInference's data, quantizing the codec yields no runtime savings)

Round-trip validation

End-to-end TTS → faster-whisper large-v3 ASR on a held-out sentence per language (Character Error Rate):

Language	en	es	de	fr	it	vi	zh	hi
CER	0.00%	0.00%	0.00%	0.00%	0.00%	<8% (tone)	<2% (1 added interjection)	mixed-script Whisper artifact

Usage

import json
from pathlib import Path
import mlx.core as mx

# 1. Tokenize text in your app (Swift) — see speech-swift's KokoroTTS
#    pattern. For Japanese, use Apple's CFStringTokenizer + katakana → IPA.
# 2. Load the 3 sub-models and run the AR loop.
from huggingface_hub import snapshot_download
bundle = Path(snapshot_download("aufklarer/Magpie-TTS-Multilingual-357M-MLX-8bit"))

# Production usage: see https://github.com/soniqo/speech-swift.

The production Swift integration handles tokenization, the AR loop, KV-cache management, and audio rendering. This HuggingFace bundle exists for researchers and SDK developers building atop the MLX weights directly.

Source

Upstream weights: nvidia/magpie_tts_multilingual_357m (NVIDIA Open Model License)
Codec: nvidia/nemo-nano-codec-22khz-1.89kbps-21.5fps
Paper: NanoCodec: Towards High-Quality Ultra Fast Speech LLM Inference

License

NVIDIA Open Model License — inherited from upstream Magpie-TTS Multilingual. Suitable for commercial use; please review the license text linked above.