Confucius4-TTS-mlx (int8)

8-bit quantized MLX build of netease-youdao/Confucius4-TTS (multilingual + cross-lingual zero-shot voice cloning, 14 languages: zh, en, ja, ko, de, fr, es, id, it, th, pt, ru, ms, vi) for Apple Silicon.

8-bit (group 64) on the T2S body matmuls and the w2v-bert encoder linears; semantic_head + norms + embeddings kept fp32 (8-bit on the head audibly degrades pronunciation). S2A flow + BigVGAN vocoder are fp32. ~2.6 GB total.

T2S: ~2.64 GB (fp32) -> ~1.2 GB
w2v-bert: ~1.5 GB (fp32) -> ~0.6 GB
Speed (Apple M5): RTF ~1.7 (vs ~2.4 fp32)
Quality: matched to fp32 in listening tests

Usage

Needs the confucius4 model in mlx-audio (PR #799):

from mlx_audio.tts.utils import load
model = load("beyoru/Confucius4-TTS-mlx-int8")
for r in model.generate("Xin chào", ref_audio="voice.wav", lang="vi"):
    ...  # r.audio at 22050 Hz

Attribution & license

Model & architecture: netease-youdao/Confucius4-TTS (Apache-2.0)
Vocoder: NVIDIA BigVGAN v2; speaker encoder: 3D-Speaker CAMPPlus (funasr)
MLX port by Hert4, released under Apache-2.0.

Downloads last month: 38

MLX

Hardware compatibility

Quantized

Model tree for beyoru/Confucius4-TTS-mlx-int8

Base model

netease-youdao/Confucius4-TTS

Finetuned

(3)

this model