Instructions to use aitytech/CosyVoice3-0.5B-MLX-4bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use aitytech/CosyVoice3-0.5B-MLX-4bit with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir CosyVoice3-0.5B-MLX-4bit aitytech/CosyVoice3-0.5B-MLX-4bit
- CosyVoice
How to use aitytech/CosyVoice3-0.5B-MLX-4bit with CosyVoice:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
metadata
language:
- zh
- en
- ja
- ko
- de
- es
- fr
- it
- ru
license: apache-2.0
tags:
- tts
- text-to-speech
- speech-synthesis
- mlx
- apple-silicon
- cosyvoice
base_model: FunAudioLLM/Fun-CosyVoice3-0.5B-2512
pipeline_tag: text-to-speech
CosyVoice3-0.5B MLX 4-bit
CosyVoice 3 text-to-speech model converted to MLX safetensors format with 4-bit quantization for Apple Silicon inference.
Converted from FunAudioLLM/Fun-CosyVoice3-0.5B-2512.
Swift inference: ivan-digital/qwen3-asr-swift
Model Details
| Component | Architecture | Size |
|---|---|---|
| LLM | Qwen2.5-0.5B (24L, 896d, 14Q/2KV heads) | 467 MB (4-bit) |
| DiT Flow Matching | 22-layer DiT (1024d, 16 heads, 10 ODE steps) | 634 MB (fp16) |
| HiFi-GAN Vocoder | NSF + F0 predictor + ISTFT | 79 MB (fp16) |
| Total | ~1.2 GB |
Pipeline
Text → LLM (Qwen2.5-0.5B) → Speech Tokens (FSQ 6561) → DiT Flow Matching → Mel (80-band) → HiFi-GAN → Audio (24kHz)
Languages
Chinese, English, Japanese, Korean, German, Spanish, French, Italian, Russian
Files
llm.safetensors— LLM weights (4-bit quantized)flow.safetensors— DiT flow matching decoder (fp16)hifigan.safetensors— HiFi-GAN vocoder (fp16, weight-norm folded)config.json— Model configuration
Conversion Details
- LLM: 4-bit quantization (group_size=64) of attention projections, MLP, and speech head
- Flow: fp16 (flow matching is sensitive to quantization)
- HiFi-GAN: fp16 with weight normalization folded (
w = g * v / ||v||) - Conv1d weights transposed from PyTorch
[out, in, kernel]to MLX[out, kernel, in]
Usage
For use with ivan-digital/qwen3-asr-swift:
import CosyVoiceTTS
let model = try await CosyVoiceTTSModel.fromPretrained()
let audio = model.synthesize(text: "Hello, how are you?", language: "english")
CLI
swift run cosyvoice-tts-cli --text "Hello, how are you?" --lang english --output hello.wav
License
Apache 2.0 (same as upstream CosyVoice 3)
Citation
@article{du2025cosyvoice3,
title={CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training},
author={Du, Zhihao and others},
journal={arXiv preprint arXiv:2505.17589},
year={2025}
}