CosyVoice3-0.5B MLX bf16

CosyVoice 3 text-to-speech model converted to MLX safetensors format with unquantized bf16 weights for Apple Silicon inference. Includes the S3-Tokenizer-v3 reference-audio encoder needed for zero-shot voice cloning.

Converted from FunAudioLLM/Fun-CosyVoice3-0.5B-2512.

Swift inference: speech-swift

Variants

Variant LLM DiT Total Use case
This bundle (bf16) bf16 bf16 ~2.1 GB Reference quality — no quantization noise anywhere
8-bit-full int8 (group_size=64) int8 (group_size=64) ~1.6 GB Best quality/size trade-off
8-bit int8 (group_size=64) int4 ~1.4 GB Cleaner LLM logits, light DiT
4-bit int4 (group_size=64) int4 ~1.2 GB Smallest download / disk footprint

All bundles include the speech tokenizer and support zero-shot voice cloning. Choose bf16 when LLM/DiT quantisation noise is a problem (long-form synthesis, low-resource languages, voice cloning fidelity) and disk/RAM are not a concern.

Model Details

Component Architecture Size
LLM Qwen2.5-0.5B (24L, 896d, 14Q/2KV heads) 965 MB (bf16)
DiT Flow Matching 22-layer DiT (1024d, 16 heads, 10 ODE steps) 634 MB (bf16)
HiFi-GAN Vocoder NSF + F0 predictor + ISTFT 79 MB (fp32)
S3-Tokenizer-v3 12-layer Conformer + FSMN + FSQ (242M params) 462 MB (bf16)
Total ~2.1 GB

Pipeline

Text          ─┐
                ├─► LLM (Qwen2.5-0.5B bf16)  ─► Speech tokens (FSQ 6561)
Ref transcript ┘                                           │
                                                           ▼
                              ┌─► prompt_token ─┐
Reference WAV ─► S3-Tokenizer-v3                ├─► DiT Flow Matching ─► Mel
              ─► Matcha mel    ─► prompt_feat ─┘    (cond + spk_emb, bf16)    │
              ─► CAM++         ─► flow_embedding                              ▼
                                                                          HiFi-GAN
                                                                              │
                                                                              ▼
                                                                         Audio (24 kHz)

Languages

Chinese, English, Japanese, Korean, German, Spanish, French, Italian, Russian.

Files

  • llm.safetensors — LLM weights (bf16, unquantised)
  • flow.safetensors — DiT flow matching decoder (bf16, unquantised)
  • hifigan.safetensors — HiFi-GAN vocoder (fp32, weight-norm folded)
  • speech_tokenizer.safetensors — S3-Tokenizer-v3 reference encoder (bf16)
  • config.json — Model configuration (tokenizer + frame rates)
  • vocab.json / merges.txt / tokenizer_config.json — Qwen2.5 BPE tokenizer

Conversion Details

  • LLM: bf16 throughout (no group quantisation applied)
  • Flow / DiT: bf16 throughout (no group quantisation applied)
  • HiFi-GAN: fp32 with weight normalization folded (w = g * v / ||v||)
  • Speech tokenizer: bf16 (runs once per voice profile, accuracy outweighs disk size)
  • Conv1d weights transposed from PyTorch [out, in, kernel] to MLX [out, kernel, in]

Zero-Shot Voice Cloning

For best clone quality the LLM needs both the reference's acoustic prefix AND its text transcript. Upstream's inference_zero_shot feeds the LLM concat(prompt_text, content_text) plus the reference's FSQ codes as autoregressive prefix; this bundle ships everything you need for that path.

import CosyVoiceTTS

let model = try await CosyVoiceTTSModel.fromPretrained(
    modelId: "aufklarer/CosyVoice3-0.5B-MLX-bf16"
)

let result = try await model.synthesize(
    text: "你好,欢迎来到 CosyVoice 三。",
    referenceWAV: refURL,
    referenceTranscript: "床前明月光,疑是地上霜。",
)

Source

Upstream: FunAudioLLM/Fun-CosyVoice3-0.5B-2512 Paper: CosyVoice 3 (arXiv:2505.17589)

Links

License

Apache 2.0 (inherited from upstream).

Downloads last month
-
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for aufklarer/CosyVoice3-0.5B-MLX-bf16

Finetuned
(11)
this model

Collection including aufklarer/CosyVoice3-0.5B-MLX-bf16

Paper for aufklarer/CosyVoice3-0.5B-MLX-bf16