Magpie-TTS-Multilingual-357M-MLX-8bit

MLX port of NVIDIA Magpie-TTS Multilingual 357M, an autoregressive multi-codebook TTS model over the Nano-Codec 22 kHz / 1.89 kbps / 21.5 fps vocoder, quantized to INT8 weight-only for Apple Silicon.

Transparent-quality. ~57% the size of FP16. Use for fidelity-sensitive deployments.

Model

Total parameters 357 M (text encoder 99 M + decoder 90 M + LocalTransformer 1 M + NanoCodec 62 M + audio embeddings)
Architecture Causal Transformer encoder (6L, d=768) + Causal Transformer decoder (12L, d=768) + LocalTransformer codebook AR (1L, d=256) + Causal HiFi-GAN decoder
Audio 8 codebooks Γ— 2024 codes, 22.05 kHz mono, 21.5 fps
Languages EN, ES, DE, FR, IT, VI, ZH, HI, JA
Speakers 5 baked (John, Sofia, Aria, Jason, Leo)
Bundle size 410 MB on disk
Layout 4-bundle MLX (text_encoder / decoder_prefill / decoder_step / nanocodec_decoder)

Files

File Size Description
text_encoder/model.safetensors 108.9 MB text encoder weights (INT8)
decoder_prefill/model.safetensors 129.4 MB decoder prefill weights (INT8)
decoder_step/model.safetensors 129.4 MB decoder step weights (INT8)
nanocodec_decoder/model.safetensors 63.2 MB nanocodec decoder weights (INT8)
tokenizer/*.json ~30 KB each per-language tokenizer config (8 langs + manifest)
manifest.json <1 KB SHA256 + sizes manifest

The 4-bundle layout splits the model into:

  • text_encoder β€” runs once per utterance over the phoneme sequence
  • decoder_prefill β€” batch-prefills the 110-step baked speaker context into the KV cache (~10Γ— faster than a sequential cold start)
  • decoder_step β€” single AR step over the next audio frame; shares weights with decoder_prefill
  • nanocodec_decoder β€” codes β†’ 22.05 kHz waveform (always FP16; per FluidInference's data, quantizing the codec yields no runtime savings)

Round-trip validation

End-to-end TTS β†’ faster-whisper large-v3 ASR on a held-out sentence per language (Character Error Rate):

Language en es de fr it vi zh hi
CER 0.00% 0.00% 0.00% 0.00% 0.00% <8% (tone) <2% (1 added interjection) mixed-script Whisper artifact

Usage

import json
from pathlib import Path
import mlx.core as mx

# 1. Tokenize text in your app (Swift) β€” see speech-swift's KokoroTTS
#    pattern. For Japanese, use Apple's CFStringTokenizer + katakana β†’ IPA.
# 2. Load the 3 sub-models and run the AR loop.
from huggingface_hub import snapshot_download
bundle = Path(snapshot_download("aufklarer/Magpie-TTS-Multilingual-357M-MLX-8bit"))

# Production usage: see https://github.com/soniqo/speech-swift.

The production Swift integration handles tokenization, the AR loop, KV-cache management, and audio rendering. This HuggingFace bundle exists for researchers and SDK developers building atop the MLX weights directly.

Source

License

NVIDIA Open Model License β€” inherited from upstream Magpie-TTS Multilingual. Suitable for commercial use; please review the license text linked above.

Downloads last month

-

Downloads are not tracked for this model. How to track
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for aufklarer/Magpie-TTS-Multilingual-357M-MLX-8bit

Finetuned
(5)
this model

Collection including aufklarer/Magpie-TTS-Multilingual-357M-MLX-8bit

Paper for aufklarer/Magpie-TTS-Multilingual-357M-MLX-8bit