Magpie-TTS-Multilingual-357M-CoreML-8bit
- speech-swift β Apple SDK
- soniqo.audio β website
- blog β blog
Core ML port of NVIDIA Magpie-TTS Multilingual 357M, an autoregressive multi-codebook TTS model over the Nano-Codec 22 kHz / 1.89 kbps / 21.5 fps vocoder, quantized to INT8 weight-only for Apple Silicon.
Core ML INT8 bundle for iOS / macOS. Four compiled .mlmodelc packages with scatter-based KV cache (fully static graph, ANE-friendly).
Model
| Total parameters | 357 M (text encoder 99 M + decoder 90 M + LocalTransformer 1 M + NanoCodec 62 M + audio embeddings) |
| Architecture | Causal Transformer encoder (6L, d=768) + Causal Transformer decoder (12L, d=768) + LocalTransformer codebook AR (1L, d=256) + Causal HiFi-GAN decoder |
| Audio | 8 codebooks Γ 2024 codes, 22.05 kHz mono, 21.5 fps |
| Languages | EN, ES, DE, FR, IT, VI, ZH, HI, JA |
| Speakers | 5 baked (John, Sofia, Aria, Jason, Leo) |
| Bundle size | 342 MB on disk |
| Layout | 4-bundle Core ML (text_encoder / decoder_prefill / decoder_step / nanocodec_decoder) |
Files
| File | Size | Description |
|---|---|---|
text_encoder.mlmodelc/ |
97 MB | text encoder (INT8) |
decoder_prefill.mlmodelc/ |
87 MB | decoder prefill (INT8) |
decoder_step.mlmodelc/ |
97 MB | decoder step (INT8) |
nanocodec_decoder.mlmodelc/ |
61 MB | nanocodec decoder (FP16) |
manifest.json |
<1 KB | SHA256 + sizes manifest |
The 4-bundle layout splits the model into:
- text_encoder β runs once per utterance over the phoneme sequence
- decoder_prefill β batch-prefills the 110-step baked speaker context into the KV cache (~10Γ faster than a sequential cold start)
- decoder_step β single AR step over the next audio frame; shares weights with decoder_prefill
- nanocodec_decoder β codes β 22.05 kHz waveform (always FP16; per FluidInference's data, quantizing the codec yields no runtime savings)
Round-trip validation
End-to-end TTS β faster-whisper large-v3 ASR on a held-out sentence per language (Character Error Rate):
| Language | en | es | de | fr | it | vi | zh | hi |
|---|---|---|---|---|---|---|---|---|
| CER | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | <8% (tone) | <2% (1 added interjection) | mixed-script Whisper artifact |
Usage
import json
from pathlib import Path
import mlx.core as mx
# 1. Tokenize text in your app (Swift) β see speech-swift's KokoroTTS
# pattern. For Japanese, use Apple's CFStringTokenizer + katakana β IPA.
# 2. Load the 3 sub-models and run the AR loop.
from huggingface_hub import snapshot_download
bundle = Path(snapshot_download("aufklarer/Magpie-TTS-Multilingual-357M-CoreML-8bit"))
# Production usage: see https://github.com/soniqo/speech-swift.
The production Swift integration handles tokenization, the AR loop, KV-cache management, and audio rendering. This HuggingFace bundle exists for researchers and SDK developers building atop the MLX weights directly.
Source
- Upstream weights: nvidia/magpie_tts_multilingual_357m (NVIDIA Open Model License)
- Codec: nvidia/nemo-nano-codec-22khz-1.89kbps-21.5fps
- Paper: NanoCodec: Towards High-Quality Ultra Fast Speech LLM Inference
License
NVIDIA Open Model License β inherited from upstream Magpie-TTS Multilingual. Suitable for commercial use; please review the license text linked above.
Model tree for aufklarer/Magpie-TTS-Multilingual-357M-CoreML-8bit
Base model
nvidia/magpie_tts_multilingual_357m