Instructions to use beyoru/Confucius4-TTS-mlx with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use beyoru/Confucius4-TTS-mlx with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir Confucius4-TTS-mlx beyoru/Confucius4-TTS-mlx
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
Confucius4-TTS-mlx
An MLX port of netease-youdao/Confucius4-TTS (multilingual, cross-lingual, zero-shot voice-cloning TTS) for Apple Silicon.
The official model is CUDA-only. This repo re-implements the heavy parts in MLX so they run on the Mac GPU (Metal), and was validated numerically against the original PyTorch model at every stage.
What runs where
| Stage | Backend | Notes |
|---|---|---|
| Frontend: w2v-bert feature extraction, CAMPPlus speaker enc, mel | PyTorch (MPS) | not ported; runs once per utterance |
| T2S (GPT-2 decode, KV-cached) | MLX | argmax matches torch 99.2% |
| S2A flow-matching (DiT + WaveNet, 25-step Euler + CFG) | MLX | mel rel. err 0.77% |
| BigVGAN vocoder | MLX | waveform corr 0.9998 |
Benchmark (Apple M5, 24 GB)
3.8 s of audio generated end-to-end in **8.7 s** (frontend 0.6 / T2S 4.1 / S2A 1.4 / vocoder 2.6),
vs ~28.6 s for the original PyTorch pipeline on CPU.
Contents
confucius_mlx/โ MLX implementations (t2s_mlx.py,s2a_mlx.py,vocoder_mlx.py)weights/t2s_model.safetensorsโ T2S weights (F32, loads directly withmx.load)weights/s2a_mlx.safetensorsโ S2A weights (weight-norm folded)weights/bigvgan_mlx.safetensorsโ BigVGAN vocoder weights (weight-norm folded, from NVIDIA BigVGAN v2)checkpoints/โ tokenizer + w2v-bert normalization statsinference_config.yaml,scripts/convert_bigvgan.py,infer_mlx.py
scripts/convert_bigvgan.py is included for reproducibility (re-fetch + fold from the
original NVIDIA checkpoint), but the converted weights ship in weights/ so you don't need to run it.
Usage
The frontend still uses the original repo. Set it up alongside:
git clone https://github.com/netease-youdao/Confucius4-TTS.git
pip install mlx torch torchaudio transformers==4.52.4 sentencepiece soundfile librosa pyyaml
python infer_mlx.py --ref voice.wav --text "Xin chร o" --lang vi --out out.wav
Status / limitations
Research work-in-progress. The frontend (w2v-bert conformer) is intentionally left on PyTorch/MPS. Numbers above are single-utterance on M5.
Attribution & license
- Model & architecture: netease-youdao/Confucius4-TTS, Apache-2.0.
- Vocoder: NVIDIA BigVGAN v2 (fetched separately, see above).
- This MLX port is released under Apache-2.0.
Quantized
Model tree for beyoru/Confucius4-TTS-mlx
Base model
netease-youdao/Confucius4-TTS