Instructions to use aufklarer/CosyVoice3-0.5B-MLX-4bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use aufklarer/CosyVoice3-0.5B-MLX-4bit with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir CosyVoice3-0.5B-MLX-4bit aufklarer/CosyVoice3-0.5B-MLX-4bit
- CosyVoice
How to use aufklarer/CosyVoice3-0.5B-MLX-4bit with CosyVoice:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
CosyVoice3-0.5B MLX 4-bit
CosyVoice 3 text-to-speech model converted to MLX safetensors format with 4-bit quantization for Apple Silicon inference. Includes the S3-Tokenizer-v3 reference-audio encoder needed for zero-shot voice cloning.
Converted from FunAudioLLM/Fun-CosyVoice3-0.5B-2512.
Swift inference: soniqo/speech-swift
Variants
| Variant | LLM | DiT | Total | Use case |
|---|---|---|---|---|
| This bundle (4-bit) | int4 (group_size=64) | int4 | ~1.1 GB | Smaller download / disk footprint |
| 8-bit | int8 (group_size=64) | int4 | ~1.4 GB | Perceptually cleaner audio, less text drift on long form |
Both bundles include the speech tokenizer and support zero-shot voice cloning.
Model Details
| Component | Architecture | Size |
|---|---|---|
| LLM | Qwen2.5-0.5B (24L, 896d, 14Q/2KV heads) | 388 MB (4-bit) |
| DiT Flow Matching | 22-layer DiT (1024d, 16 heads, 10 ODE steps) | 186 MB (4-bit) |
| HiFi-GAN Vocoder | NSF + F0 predictor + ISTFT | 79 MB (fp32) |
| S3-Tokenizer-v3 | 12-layer Conformer + FSMN + FSQ (242M params) | 462 MB (bf16) |
| Total | ~1.1 GB |
Pipeline
Text ─┐
├─► LLM (Qwen2.5-0.5B int4) ─► Speech tokens (FSQ 6561)
Ref transcript ┘ │
▼
┌─► prompt_token ─┐
Reference WAV ─► S3-Tokenizer-v3 ├─► DiT Flow Matching ─► Mel
─► Matcha mel ─► prompt_feat ─┘ (cond + spk_emb) │
─► CAM++ ─► flow_embedding ▼
HiFi-GAN
│
▼
Audio (24 kHz)
Languages
Chinese, English, Japanese, Korean, German, Spanish, French, Italian, Russian.
Files
llm.safetensors— LLM weights (4-bit group-quantised)flow.safetensors— DiT flow matching decoder (4-bit DiT, fp32 input/output projections)hifigan.safetensors— HiFi-GAN vocoder (fp32, weight-norm folded)speech_tokenizer.safetensors— S3-Tokenizer-v3 reference encoder (bf16)config.json— Model configuration (quantisation bits, tokenizer + frame rates)vocab.json/merges.txt/tokenizer_config.json— Qwen2.5 BPE tokenizer
Conversion Details
- LLM: 4-bit group quantization (group_size=64) of attention projections, MLP, and speech head
- Flow / DiT: 4-bit group quantization of attention + FFN linears, fp32 input/output projections
- HiFi-GAN: fp32 with weight normalization folded (
w = g * v / ||v||) - Speech tokenizer: bf16 (runs once per voice profile, accuracy outweighs disk size)
- Conv1d weights transposed from PyTorch
[out, in, kernel]to MLX[out, kernel, in]
Zero-Shot Voice Cloning
For best clone quality the LLM needs both the reference's acoustic prefix AND its text transcript. Upstream's inference_zero_shot feeds the LLM concat(prompt_text, content_text) plus the reference's FSQ codes as autoregressive prefix; this bundle ships everything you need for that path.
import CosyVoiceTTS
let model = try await CosyVoiceTTSModel.fromPretrained() // defaults to this 4-bit bundle
let refAudio = try AudioFileLoader.load(
url: URL(fileURLWithPath: "ref.wav"), targetSampleRate: 16_000)
let cacheDir = try HuggingFaceDownloader.getCacheDirectory(
for: "aufklarer/CosyVoice3-0.5B-MLX-4bit")
let tokenizer = try SpeechTokenizerModel.fromSafetensors(
at: cacheDir.appendingPathComponent("speech_tokenizer.safetensors"))
let profile = try model.extractVoiceProfile(
audio: refAudio, sampleRate: 16_000,
speechTokenizer: tokenizer,
referenceTranscript: "Transcript of the reference clip."
)
let audio = model.synthesize(
text: "Welcome to the demo.",
voiceProfile: profile,
language: "english"
)
CLI
audio speak "Welcome to the demo." \
--engine cosyvoice \
--voice-sample ref.wav \
--cosy-reference-transcript "Transcript of ref.wav..." \
--output out.wav
License
Apache 2.0 (same as upstream CosyVoice 3).
Citation
@article{du2025cosyvoice3,
title={CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training},
author={Du, Zhihao and others},
journal={arXiv preprint arXiv:2505.17589},
year={2025}
}
The S3-Tokenizer-v3 PyTorch reimplementation used at conversion time is xingchensong/S3Tokenizer.
- Guide: soniqo.audio/guides/cosyvoice
- GitHub: soniqo/speech-swift
- Downloads last month
- 403
Quantized
Model tree for aufklarer/CosyVoice3-0.5B-MLX-4bit
Base model
FunAudioLLM/Fun-CosyVoice3-0.5B-2512