Confucius4-TTS-mlx

An MLX port of netease-youdao/Confucius4-TTS (multilingual, cross-lingual, zero-shot voice-cloning TTS) for Apple Silicon.

The official model is CUDA-only. This repo re-implements the heavy parts in MLX so they run on the Mac GPU (Metal), and was validated numerically against the original PyTorch model at every stage.

What runs where

Stage	Backend	Notes
Frontend: w2v-bert feature extraction, CAMPPlus speaker enc, mel	PyTorch (MPS)	not ported; runs once per utterance
T2S (GPT-2 decode, KV-cached)	MLX	argmax matches torch 99.2%
S2A flow-matching (DiT + WaveNet, 25-step Euler + CFG)	MLX	mel rel. err 0.77%
BigVGAN vocoder	MLX	waveform corr 0.9998

Benchmark (Apple M5, 24 GB)

3.8 s of audio generated end-to-end in **8.7 s** (frontend 0.6 / T2S 4.1 / S2A 1.4 / vocoder 2.6), vs ~28.6 s for the original PyTorch pipeline on CPU.

confucius_mlx/ — MLX implementations (t2s_mlx.py, s2a_mlx.py, vocoder_mlx.py)
weights/t2s_model.safetensors — T2S weights (F32, loads directly with mx.load)
weights/s2a_mlx.safetensors — S2A weights (weight-norm folded)
weights/bigvgan_mlx.safetensors — BigVGAN vocoder weights (weight-norm folded, from NVIDIA BigVGAN v2)
checkpoints/ — tokenizer + w2v-bert normalization stats
inference_config.yaml, scripts/convert_bigvgan.py, infer_mlx.py

scripts/convert_bigvgan.py is included for reproducibility (re-fetch + fold from the original NVIDIA checkpoint), but the converted weights ship in weights/ so you don't need to run it.

Usage

The frontend still uses the original repo. Set it up alongside:

git clone https://github.com/netease-youdao/Confucius4-TTS.git
pip install mlx torch torchaudio transformers==4.52.4 sentencepiece soundfile librosa pyyaml
python infer_mlx.py --ref voice.wav --text "Xin chào" --lang vi --out out.wav

Status / limitations

Research work-in-progress. The frontend (w2v-bert conformer) is intentionally left on PyTorch/MPS. Numbers above are single-utterance on M5.

Attribution & license

Model & architecture: netease-youdao/Confucius4-TTS, Apache-2.0.
Vocoder: NVIDIA BigVGAN v2 (fetched separately, see above).
This MLX port is released under Apache-2.0.

Downloads last month: -; Downloads are not tracked for this model. How to track

MLX

Hardware compatibility

Quantized

Model tree for beyoru/Confucius4-TTS-mlx

Base model

netease-youdao/Confucius4-TTS

Finetuned

(3)

this model

beyoru
/

Confucius4-TTS-mlx