Instructions to use aufklarer/CosyVoice3-0.5B-MLX-bf16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use aufklarer/CosyVoice3-0.5B-MLX-bf16 with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir CosyVoice3-0.5B-MLX-bf16 aufklarer/CosyVoice3-0.5B-MLX-bf16
- CosyVoice
How to use aufklarer/CosyVoice3-0.5B-MLX-bf16 with CosyVoice:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
CosyVoice3-0.5B MLX bf16
CosyVoice 3 text-to-speech model converted to MLX safetensors format with unquantized bf16 weights for Apple Silicon inference. Includes the S3-Tokenizer-v3 reference-audio encoder needed for zero-shot voice cloning.
Converted from FunAudioLLM/Fun-CosyVoice3-0.5B-2512.
Swift inference: speech-swift
Variants
| Variant | LLM | DiT | Total | Use case |
|---|---|---|---|---|
| This bundle (bf16) | bf16 | bf16 | ~2.1 GB | Reference quality — no quantization noise anywhere |
| 8-bit-full | int8 (group_size=64) | int8 (group_size=64) | ~1.6 GB | Best quality/size trade-off |
| 8-bit | int8 (group_size=64) | int4 | ~1.4 GB | Cleaner LLM logits, light DiT |
| 4-bit | int4 (group_size=64) | int4 | ~1.2 GB | Smallest download / disk footprint |
All bundles include the speech tokenizer and support zero-shot voice cloning. Choose bf16 when LLM/DiT quantisation noise is a problem (long-form synthesis, low-resource languages, voice cloning fidelity) and disk/RAM are not a concern.
Model Details
| Component | Architecture | Size |
|---|---|---|
| LLM | Qwen2.5-0.5B (24L, 896d, 14Q/2KV heads) | 965 MB (bf16) |
| DiT Flow Matching | 22-layer DiT (1024d, 16 heads, 10 ODE steps) | 634 MB (bf16) |
| HiFi-GAN Vocoder | NSF + F0 predictor + ISTFT | 79 MB (fp32) |
| S3-Tokenizer-v3 | 12-layer Conformer + FSMN + FSQ (242M params) | 462 MB (bf16) |
| Total | ~2.1 GB |
Pipeline
Text ─┐
├─► LLM (Qwen2.5-0.5B bf16) ─► Speech tokens (FSQ 6561)
Ref transcript ┘ │
▼
┌─► prompt_token ─┐
Reference WAV ─► S3-Tokenizer-v3 ├─► DiT Flow Matching ─► Mel
─► Matcha mel ─► prompt_feat ─┘ (cond + spk_emb, bf16) │
─► CAM++ ─► flow_embedding ▼
HiFi-GAN
│
▼
Audio (24 kHz)
Languages
Chinese, English, Japanese, Korean, German, Spanish, French, Italian, Russian.
Files
llm.safetensors— LLM weights (bf16, unquantised)flow.safetensors— DiT flow matching decoder (bf16, unquantised)hifigan.safetensors— HiFi-GAN vocoder (fp32, weight-norm folded)speech_tokenizer.safetensors— S3-Tokenizer-v3 reference encoder (bf16)config.json— Model configuration (tokenizer + frame rates)vocab.json/merges.txt/tokenizer_config.json— Qwen2.5 BPE tokenizer
Conversion Details
- LLM: bf16 throughout (no group quantisation applied)
- Flow / DiT: bf16 throughout (no group quantisation applied)
- HiFi-GAN: fp32 with weight normalization folded (
w = g * v / ||v||) - Speech tokenizer: bf16 (runs once per voice profile, accuracy outweighs disk size)
- Conv1d weights transposed from PyTorch
[out, in, kernel]to MLX[out, kernel, in]
Zero-Shot Voice Cloning
For best clone quality the LLM needs both the reference's acoustic prefix AND its text transcript. Upstream's inference_zero_shot feeds the LLM concat(prompt_text, content_text) plus the reference's FSQ codes as autoregressive prefix; this bundle ships everything you need for that path.
import CosyVoiceTTS
let model = try await CosyVoiceTTSModel.fromPretrained(
modelId: "aufklarer/CosyVoice3-0.5B-MLX-bf16"
)
let result = try await model.synthesize(
text: "你好,欢迎来到 CosyVoice 三。",
referenceWAV: refURL,
referenceTranscript: "床前明月光,疑是地上霜。",
)
Source
Upstream: FunAudioLLM/Fun-CosyVoice3-0.5B-2512 Paper: CosyVoice 3 (arXiv:2505.17589)
Links
- speech-swift — Apple SDK
- soniqo.audio — website
- blog — blog
License
Apache 2.0 (inherited from upstream).
- Downloads last month
- -
Quantized
Model tree for aufklarer/CosyVoice3-0.5B-MLX-bf16
Base model
FunAudioLLM/Fun-CosyVoice3-0.5B-2512