Magpie-TTS-Multilingual-357M-CoreML-8bit

Core ML port of NVIDIA Magpie-TTS Multilingual 357M, an autoregressive multi-codebook TTS model over the Nano-Codec 22 kHz / 1.89 kbps / 21.5 fps vocoder, quantized to INT8 weight-only for Apple Silicon.

Core ML INT8 bundle for iOS / macOS. Four compiled .mlmodelc packages with scatter-based KV cache (fully static graph, ANE-friendly).

Model

Total parameters 357 M (text encoder 99 M + decoder 90 M + LocalTransformer 1 M + NanoCodec 62 M + audio embeddings)
Architecture Causal Transformer encoder (6L, d=768) + Causal Transformer decoder (12L, d=768) + LocalTransformer codebook AR (1L, d=256) + Causal HiFi-GAN decoder
Audio 8 codebooks Γ— 2024 codes, 22.05 kHz mono, 21.5 fps
Languages EN, ES, DE, FR, IT, VI, ZH, HI, JA
Speakers 5 baked (John, Sofia, Aria, Jason, Leo)
Bundle size 342 MB on disk
Layout 4-bundle Core ML (text_encoder / decoder_prefill / decoder_step / nanocodec_decoder)

Files

File Size Description
text_encoder.mlmodelc/ 97 MB text encoder (INT8)
decoder_prefill.mlmodelc/ 87 MB decoder prefill (INT8)
decoder_step.mlmodelc/ 97 MB decoder step (INT8)
nanocodec_decoder.mlmodelc/ 61 MB nanocodec decoder (FP16)
manifest.json <1 KB SHA256 + sizes manifest

The 4-bundle layout splits the model into:

  • text_encoder β€” runs once per utterance over the phoneme sequence
  • decoder_prefill β€” batch-prefills the 110-step baked speaker context into the KV cache (~10Γ— faster than a sequential cold start)
  • decoder_step β€” single AR step over the next audio frame; shares weights with decoder_prefill
  • nanocodec_decoder β€” codes β†’ 22.05 kHz waveform (always FP16; per FluidInference's data, quantizing the codec yields no runtime savings)

Round-trip validation

End-to-end TTS β†’ faster-whisper large-v3 ASR on a held-out sentence per language (Character Error Rate):

Language en es de fr it vi zh hi
CER 0.00% 0.00% 0.00% 0.00% 0.00% <8% (tone) <2% (1 added interjection) mixed-script Whisper artifact

Usage

import json
from pathlib import Path
import mlx.core as mx

# 1. Tokenize text in your app (Swift) β€” see speech-swift's KokoroTTS
#    pattern. For Japanese, use Apple's CFStringTokenizer + katakana β†’ IPA.
# 2. Load the 3 sub-models and run the AR loop.
from huggingface_hub import snapshot_download
bundle = Path(snapshot_download("aufklarer/Magpie-TTS-Multilingual-357M-CoreML-8bit"))

# Production usage: see https://github.com/soniqo/speech-swift.

The production Swift integration handles tokenization, the AR loop, KV-cache management, and audio rendering. This HuggingFace bundle exists for researchers and SDK developers building atop the MLX weights directly.

Source

License

NVIDIA Open Model License β€” inherited from upstream Magpie-TTS Multilingual. Suitable for commercial use; please review the license text linked above.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for aufklarer/Magpie-TTS-Multilingual-357M-CoreML-8bit

Finetuned
(5)
this model

Collection including aufklarer/Magpie-TTS-Multilingual-357M-CoreML-8bit

Paper for aufklarer/Magpie-TTS-Multilingual-357M-CoreML-8bit