Instructions to use aufklarer/Magpie-TTS-Multilingual-357M-MLX-8bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use aufklarer/Magpie-TTS-Multilingual-357M-MLX-8bit with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir Magpie-TTS-Multilingual-357M-MLX-8bit aufklarer/Magpie-TTS-Multilingual-357M-MLX-8bit
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
Magpie-TTS-Multilingual-357M-MLX-8bit
- speech-swift β Apple SDK
- soniqo.audio β website
- blog β blog
MLX port of NVIDIA Magpie-TTS Multilingual 357M, an autoregressive multi-codebook TTS model over the Nano-Codec 22 kHz / 1.89 kbps / 21.5 fps vocoder, quantized to INT8 weight-only for Apple Silicon.
Transparent-quality. ~57% the size of FP16. Use for fidelity-sensitive deployments.
Model
| Total parameters | 357 M (text encoder 99 M + decoder 90 M + LocalTransformer 1 M + NanoCodec 62 M + audio embeddings) |
| Architecture | Causal Transformer encoder (6L, d=768) + Causal Transformer decoder (12L, d=768) + LocalTransformer codebook AR (1L, d=256) + Causal HiFi-GAN decoder |
| Audio | 8 codebooks Γ 2024 codes, 22.05 kHz mono, 21.5 fps |
| Languages | EN, ES, DE, FR, IT, VI, ZH, HI, JA |
| Speakers | 5 baked (John, Sofia, Aria, Jason, Leo) |
| Bundle size | 410 MB on disk |
| Layout | 4-bundle MLX (text_encoder / decoder_prefill / decoder_step / nanocodec_decoder) |
Files
| File | Size | Description |
|---|---|---|
text_encoder/model.safetensors |
108.9 MB | text encoder weights (INT8) |
decoder_prefill/model.safetensors |
129.4 MB | decoder prefill weights (INT8) |
decoder_step/model.safetensors |
129.4 MB | decoder step weights (INT8) |
nanocodec_decoder/model.safetensors |
63.2 MB | nanocodec decoder weights (INT8) |
tokenizer/*.json |
~30 KB each | per-language tokenizer config (8 langs + manifest) |
manifest.json |
<1 KB | SHA256 + sizes manifest |
The 4-bundle layout splits the model into:
- text_encoder β runs once per utterance over the phoneme sequence
- decoder_prefill β batch-prefills the 110-step baked speaker context into the KV cache (~10Γ faster than a sequential cold start)
- decoder_step β single AR step over the next audio frame; shares weights with decoder_prefill
- nanocodec_decoder β codes β 22.05 kHz waveform (always FP16; per FluidInference's data, quantizing the codec yields no runtime savings)
Round-trip validation
End-to-end TTS β faster-whisper large-v3 ASR on a held-out sentence per language (Character Error Rate):
| Language | en | es | de | fr | it | vi | zh | hi |
|---|---|---|---|---|---|---|---|---|
| CER | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | <8% (tone) | <2% (1 added interjection) | mixed-script Whisper artifact |
Usage
import json
from pathlib import Path
import mlx.core as mx
# 1. Tokenize text in your app (Swift) β see speech-swift's KokoroTTS
# pattern. For Japanese, use Apple's CFStringTokenizer + katakana β IPA.
# 2. Load the 3 sub-models and run the AR loop.
from huggingface_hub import snapshot_download
bundle = Path(snapshot_download("aufklarer/Magpie-TTS-Multilingual-357M-MLX-8bit"))
# Production usage: see https://github.com/soniqo/speech-swift.
The production Swift integration handles tokenization, the AR loop, KV-cache management, and audio rendering. This HuggingFace bundle exists for researchers and SDK developers building atop the MLX weights directly.
Source
- Upstream weights: nvidia/magpie_tts_multilingual_357m (NVIDIA Open Model License)
- Codec: nvidia/nemo-nano-codec-22khz-1.89kbps-21.5fps
- Paper: NanoCodec: Towards High-Quality Ultra Fast Speech LLM Inference
License
NVIDIA Open Model License β inherited from upstream Magpie-TTS Multilingual. Suitable for commercial use; please review the license text linked above.
Quantized
Model tree for aufklarer/Magpie-TTS-Multilingual-357M-MLX-8bit
Base model
nvidia/magpie_tts_multilingual_357m