CosyVoice3-0.5B MLX 4-bit

CosyVoice 3 text-to-speech model converted to MLX safetensors format with 4-bit quantization for Apple Silicon inference. Includes the S3-Tokenizer-v3 reference-audio encoder needed for zero-shot voice cloning.

Converted from FunAudioLLM/Fun-CosyVoice3-0.5B-2512.

Swift inference: soniqo/speech-swift

Variants

Variant	LLM	DiT	Total	Use case
This bundle (4-bit)	int4 (group_size=64)	int4	~1.1 GB	Smaller download / disk footprint
8-bit	int8 (group_size=64)	int4	~1.4 GB	Perceptually cleaner audio, less text drift on long form

Both bundles include the speech tokenizer and support zero-shot voice cloning.

Model Details

Component	Architecture	Size
LLM	Qwen2.5-0.5B (24L, 896d, 14Q/2KV heads)	388 MB (4-bit)
DiT Flow Matching	22-layer DiT (1024d, 16 heads, 10 ODE steps)	186 MB (4-bit)
HiFi-GAN Vocoder	NSF + F0 predictor + ISTFT	79 MB (fp32)
S3-Tokenizer-v3	12-layer Conformer + FSMN + FSQ (242M params)	462 MB (bf16)
Total		~1.1 GB

Pipeline

Text          ─┐
                ├─► LLM (Qwen2.5-0.5B int4)  ─► Speech tokens (FSQ 6561)
Ref transcript ┘                                           │
                                                           ▼
                              ┌─► prompt_token ─┐
Reference WAV ─► S3-Tokenizer-v3                ├─► DiT Flow Matching ─► Mel
              ─► Matcha mel    ─► prompt_feat ─┘    (cond + spk_emb)         │
              ─► CAM++         ─► flow_embedding                              ▼
                                                                          HiFi-GAN
                                                                              │
                                                                              ▼
                                                                         Audio (24 kHz)

Languages

Chinese, English, Japanese, Korean, German, Spanish, French, Italian, Russian.

Files

llm.safetensors — LLM weights (4-bit group-quantised)
flow.safetensors — DiT flow matching decoder (4-bit DiT, fp32 input/output projections)
hifigan.safetensors — HiFi-GAN vocoder (fp32, weight-norm folded)
speech_tokenizer.safetensors — S3-Tokenizer-v3 reference encoder (bf16)
config.json — Model configuration (quantisation bits, tokenizer + frame rates)
vocab.json / merges.txt / tokenizer_config.json — Qwen2.5 BPE tokenizer

Conversion Details

LLM: 4-bit group quantization (group_size=64) of attention projections, MLP, and speech head
Flow / DiT: 4-bit group quantization of attention + FFN linears, fp32 input/output projections
HiFi-GAN: fp32 with weight normalization folded (w = g * v / ||v||)
Speech tokenizer: bf16 (runs once per voice profile, accuracy outweighs disk size)
Conv1d weights transposed from PyTorch [out, in, kernel] to MLX [out, kernel, in]

Zero-Shot Voice Cloning

For best clone quality the LLM needs both the reference's acoustic prefix AND its text transcript. Upstream's inference_zero_shot feeds the LLM concat(prompt_text, content_text) plus the reference's FSQ codes as autoregressive prefix; this bundle ships everything you need for that path.

import CosyVoiceTTS

let model = try await CosyVoiceTTSModel.fromPretrained()  // defaults to this 4-bit bundle

let refAudio = try AudioFileLoader.load(
    url: URL(fileURLWithPath: "ref.wav"), targetSampleRate: 16_000)
let cacheDir = try HuggingFaceDownloader.getCacheDirectory(
    for: "aufklarer/CosyVoice3-0.5B-MLX-4bit")
let tokenizer = try SpeechTokenizerModel.fromSafetensors(
    at: cacheDir.appendingPathComponent("speech_tokenizer.safetensors"))

let profile = try model.extractVoiceProfile(
    audio: refAudio, sampleRate: 16_000,
    speechTokenizer: tokenizer,
    referenceTranscript: "Transcript of the reference clip."
)

let audio = model.synthesize(
    text: "Welcome to the demo.",
    voiceProfile: profile,
    language: "english"
)

CLI

audio speak "Welcome to the demo." \
  --engine cosyvoice \
  --voice-sample ref.wav \
  --cosy-reference-transcript "Transcript of ref.wav..." \
  --output out.wav

License

Apache 2.0 (same as upstream CosyVoice 3).

Citation

@article{du2025cosyvoice3,
  title={CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training},
  author={Du, Zhihao and others},
  journal={arXiv preprint arXiv:2505.17589},
  year={2025}
}

The S3-Tokenizer-v3 PyTorch reimplementation used at conversion time is xingchensong/S3Tokenizer.