Confucius4-TTS-mlx (int8)

8-bit quantized MLX build of netease-youdao/Confucius4-TTS (multilingual + cross-lingual zero-shot voice cloning, 14 languages: zh, en, ja, ko, de, fr, es, id, it, th, pt, ru, ms, vi) for Apple Silicon.

8-bit (group 64) on the T2S body matmuls and the w2v-bert encoder linears; semantic_head + norms + embeddings kept fp32 (8-bit on the head audibly degrades pronunciation). S2A flow + BigVGAN vocoder are fp32. ~2.6 GB total.

  • T2S: ~2.64 GB (fp32) -> ~1.2 GB
  • w2v-bert: ~1.5 GB (fp32) -> ~0.6 GB
  • Speed (Apple M5): RTF ~1.7 (vs ~2.4 fp32)
  • Quality: matched to fp32 in listening tests

Usage

Needs the confucius4 model in mlx-audio (PR #799):

from mlx_audio.tts.utils import load
model = load("beyoru/Confucius4-TTS-mlx-int8")
for r in model.generate("Xin chào", ref_audio="voice.wav", lang="vi"):
    ...  # r.audio at 22050 Hz

Attribution & license

  • Model & architecture: netease-youdao/Confucius4-TTS (Apache-2.0)
  • Vocoder: NVIDIA BigVGAN v2; speaker encoder: 3D-Speaker CAMPPlus (funasr)
  • MLX port by Hert4, released under Apache-2.0.
Downloads last month
38
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for beyoru/Confucius4-TTS-mlx-int8

Finetuned
(3)
this model