Instructions to use shraey/zonos2-mlx with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use shraey/zonos2-mlx with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir zonos2-mlx shraey/zonos2-mlx
- Zonos
How to use shraey/zonos2-mlx with Zonos:
# pip install git+https://github.com/Zyphra/Zonos.git import torchaudio from zonos.model import Zonos from zonos.conditioning import make_cond_dict model = Zonos.from_pretrained("shraey/zonos2-mlx", device="cuda") wav, sr = torchaudio.load("speaker.wav") # 5-10s reference clip speaker = model.make_speaker_embedding(wav, sr) cond = make_cond_dict(text="Hello, world!", speaker=speaker, language="en-us") codes = model.generate(model.prepare_conditioning(cond)) audio = model.autoencoder.decode(codes)[0].cpu() torchaudio.save("sample.wav", audio, model.autoencoder.sampling_rate) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
zonos2-mlx β ready-to-run MLX weights
Pre-converted, pre-quantized MLX weights for Zyphra's ZONOS2 β an 8B-parameter Mixture-of-Experts autoregressive text-to-speech model β running natively on Apple Silicon.
Download and run. No PyTorch in the inference path, no conversion step.
- π§ Model: 16-expert top-1 MoE AR trunk (layer 26 routes top-2) β DAC 44.1 kHz neural codec for the waveform, with an ECAPA-TDNN speaker encoder (+ LDA) for voice cloning from a short reference clip.
- π Runtime:
sb1992/mlx-zonos2β a clean-room MLX reimplementation of the inference runtime, gated per-stage against the original PyTorch model. - π¦ This repo: the weights only. Three precision tiers, each a self-contained folder.
Tiers
Each folder (bf16/, int8/, int4/) is self-contained β it bundles the quantized trunk
plus the (tier-independent) DAC codec and ECAPA speaker encoder, so you download one folder
and it just runs.
| Folder | what's quantized | folder size | peak RAM | target Macs |
|---|---|---|---|---|
bf16/ |
nothing (reference) | ~14 GB | ~44 GB | 64 GB |
int8/ |
attention/FFN/lm_head + experts int8; router/embeddings/norms bf16 | ~7.9 GB | ~13 GB | 32 GB |
int4/ |
attention/FFN/lm_head int8; experts gate/up int4, down int8; router/embeddings/norms bf16 | ~5.7 GB | ~10.6 GB | 16 GB |
Folder size includes the bundled ~315 MB DAC codec + ECAPA speaker encoder (identical across tiers β Hugging Face Xet de-dups them, so they cost storage only once).
The MoE experts (the bulk of the 8B) carry the int4; the router/gate, the lm_head, and
the sensitive expert down projection stay int8/bf16 β the MoE-quant recipe that keeps the
model intact. All three tiers produce full, intelligible audio β they're equal options, pick
by the RAM you have.
Quick start
# 1. get the runtime
git clone https://github.com/sb1992/mlx-zonos2.git
cd mlx-zonos2
uv sync --extra oracle # `oracle` extra = torchaudio, for enrolling a voice from raw audio
# 2. download one tier (self-contained: trunk + DAC + speaker encoder)
hf download shraey/zonos2-mlx --include "int8/*" --local-dir ./zonos2-mlx-weights
# 3. clone a voice + synthesize
python scripts/zonos2_cli.py \
--model-dir ./zonos2-mlx-weights/int8 \
--text "The quick brown fox jumps over the lazy dog." \
--ref ref.wav \
--out out.wav
Swap int8 β int4 (16 GB Macs) or bf16 (64 GB Macs) β same flow, just point --model-dir
at the folder you downloaded. To grab every tier at once, drop the --include filter.
--ref enrolls a reference clip on the fly (needs the oracle extra for the mel front-end). You
can also enroll a voice once into a small .zonos profile and reuse it β then generation is
pure-MLX with no torch. See the runtime repo for the
Python API, the enroll-once flow, and the full parity report.
Responsible use
This performs voice cloning β it can reproduce a person's voice from a few seconds of audio. Use it responsibly: no impersonation, fraud, or disinformation; only clone voices you own or have explicit consent for; disclose AI-generated audio wherever it's published. See the runtime repo for the full policy.
Attribution + license
This is a derivative port. The components it builds on are each independently licensed:
- ZONOS2 β Apache-2.0, Β© Zyphra. The 8B-MoE model, the DAC 44.1 kHz codec, and the speaker encoder are Zyphra's. Code
- Released checkpoint β this port converts the
drbaph/ZONOS2-BF16release (its speaker encoder is an ECAPA-TDNN, 2048-d). - Porting oracle β the clean plain-torch Zonos2_TTS-ComfyUI fork by Saganaki22 (Apache-2.0), used as the op-for-op reference.
- MLX β Apple's ml-explore/mlx.
The MLX port code is licensed Apache-2.0. You must comply with the upstream ZONOS2 license and usage terms for the model weights. Full credit to Zyphra for the model, its training, and the open release β this repo only re-expresses their runtime in MLX.
Quantized
Model tree for shraey/zonos2-mlx
Base model
drbaph/ZONOS2-BF16