Fun-CosyVoice3-0.5B-2512 β GGUF
GGUF conversion of FunAudioLLM/Fun-CosyVoice3-0.5B-2512
for use with CrispASR
(--backend cosyvoice3-tts).
CosyVoice3 is a streaming, multilingual, zero-shot voice-cloning TTS system from Alibaba's FunAudioLLM team. The 0.5B-2512 release is Apache-2.0 licensed and supports 9 languages plus 18 Chinese dialects. Output is 24 kHz mono.
The model is a three-stage pipeline:
text (Qwen2 BPE) β CosyVoice3LM (Qwen2-0.5B + speech-token AR head)
β speech tokens β [0, 6561)
β Flow (DiT + CausalConditionalCFM, 10-step Euler ODE)
β mel @ 24 kHz / 480-hop
β CausalHiFTGenerator (HiFi-GAN + NSF + iSTFT)
β 24 kHz PCM
Files
| File | Quantisation | Size |
|---|---|---|
cosyvoice3-llm-f16.gguf |
F16 | 1.29 GB |
cosyvoice3-llm-q4_k.gguf |
Q4_K (Q4_0 fallback on 896-wide rows; head + embeddings stay F16) | 384 MB |
cosyvoice3-flow-f16.gguf |
F16 | 665 MB |
cosyvoice3-flow-q8_0.gguf |
Q8_0 (input_embd + spk_affine stay F16) | 361 MB |
cosyvoice3-hift-f16.gguf |
F16 β too small to benefit from quant | 42 MB |
cosyvoice3-voices.gguf |
F32 voice-clone bank: 8 baked voices (zero_shot + en/de/zh/ja/fr/es/ko) | 665 KB |
cosyvoice3-s3tok-f16.gguf |
F16 speech_tokenizer_v3 β byte-exact vs ONNX | 462 MB |
cosyvoice3-s3tok-q4_k.gguf |
Q4_K s3tok (FSQ proj stays F16); ~0.6% token drift β optional smaller variant | 139 MB |
cosyvoice3-campplus-f16.gguf |
F16 CAMPPlus 192-D speaker encoder | 13 MB |
Pick one LLM + one flow + HiFT + voices. The smallest viable
combo is llm-q4_k + flow-q8_0 + hift-f16 + voices at 745 MB
total; the F16 reference is 1.96 GB. The s3tok + campplus companions
are only needed for arbitrary-WAV runtime cloning (below) β not for
synthesis with a baked voice.
Quant validation (ASR roundtrip on smoke prompt)
Synthesis used the default zero-shot voice (upstream
asset/zero_shot_prompt.wav) at --temperature 0.8 --seed 42. The
generated WAV was transcribed with parakeet-tdt-0.6b-v3-q4_k and
compared against the prompt text.
| Combo | Synthesis size | ASR transcript of TTS output | WER |
|---|---|---|---|
llm-f16 + flow-f16 |
1.96 GB | "Hello, this is a test." | 0% |
llm-f16 + flow-q8_0 |
1.66 GB | "Hello, this is a test." | 0% |
llm-q4_k + flow-f16 |
1.05 GB | "Hello? This is a test." | 0% (punct only) |
llm-q4_k + flow-q8_0 |
745 MB | "Hello? This is a test." | 0% (punct only) |
llm-q4_k + flow-q8_0 (German) |
β | "Hallo? Das ist ein Test." | 0% (punct only) |
Q4_K LLM introduces a small punctuation drift (commas occasionally read as question-intonation) but content is fully preserved across languages. Q8_0 flow is perceptually indistinguishable from F16.
Usage
CrispASR (recommended)
# Auto-discovers flow + hift + voices as siblings of the LLM.
crispasr -m cosyvoice3-llm-q4_k.gguf \
--backend cosyvoice3-tts \
--tts "Hello, this is a test." \
--voice zero_shot \
--tts-output out.wav
The CLI auto-discovers companion GGUFs in this order:
- Flow β
cosyvoice3-flow-*.ggufnext to the LLM, or--codec-model PATH. - CAMPPlus β
cosyvoice3-campplus-f16.ggufnext to the LLM, orCOSYVOICE3_CAMPPLUS_PATHfor the native WAV-clone path. - S3Tokenizer β
cosyvoice3-s3tok-f16.ggufnext to the LLM, orCOSYVOICE3_S3TOK_PATHfor the native WAV-clone path. - HiFT β
cosyvoice3-hift-*.ggufnext to the LLM, orCOSYVOICE3_HIFT_PATHenv var. - Voices β
cosyvoice3-voices.ggufnext to the LLM, orCOSYVOICE3_VOICES_PATHenv var.
Greedy decode is disabled by default (CV3 falls into a documented
"silent_tokens" loop within ~5 steps). The backend overrides
--temperature 0 to 0.8 so the RAS sampler engages β pass a different
positive value to override.
Voices
cosyvoice3-voices.gguf ships a small multilingual voice bank β pass
the name to --voice:
--voice |
Language | Prompt source |
|---|---|---|
zero_shot |
Mandarin | upstream asset/zero_shot_prompt.wav (~3.5 s) |
fleurs-en |
English | FLEURS en (CC BY 4.0) |
fleurs-de |
German | FLEURS de (CC BY 4.0) |
fleurs-zh |
Mandarin | FLEURS zh (CC BY 4.0) |
fleurs-ja |
Japanese | FLEURS ja (CC BY 4.0) |
fleurs-fr |
French | FLEURS fr (CC BY 4.0) |
fleurs-es |
Spanish | FLEURS es (CC BY 4.0) |
fleurs-ko |
Korean | FLEURS ko (CC BY 4.0) |
The fleurs-* prompts are ~4β6 s clips from Google's
FLEURS corpus
(CC BY 4.0), loudness-normalised before baking. CV3 clones the prompt's
timbre and level, so quiet prompts yield quiet output β normalise your
own prompt clips for a consistent level. More voices can be baked with
the converter in the CrispASR tree:
python models/convert-cosyvoice3-voices-to-gguf.py \
--manifest my-voices.json \
--upstream-base /path/to/CosyVoice-clone \
--output my-voices.gguf
Each manifest entry is {name, wav, prompt_text}. The script needs
campplus.onnx (CV2/CV3 speaker encoder) and
speech_tokenizer_v3.onnx (CV3 token extractor); both auto-download
from HF on first run.
Arbitrary-WAV cloning (native, no Python pre-bake)
With the cosyvoice3-s3tok-f16.gguf + cosyvoice3-campplus-f16.gguf
companions present (siblings of the LLM, or pulled by -m auto), you
can clone from any 16 kHz WAV at runtime:
crispasr -m cosyvoice3-llm-q4_k.gguf \
--backend cosyvoice3-tts \
--voice my_reference.wav \
--ref-text "exact transcription of my_reference.wav" \
--tts "The text to speak in the cloned voice." \
--tts-output out.wav
The runtime ports all three front-end extractors to ggml: the
speech_tokenizer_v3 token extractor (12 FSMN/attention blocks +
FSQ head β byte-exact vs the ONNX reference, validated stage-by-stage
with crispasr-diff), the CAMPPlus 192-D speaker encoder, and the
matcha 24 kHz reference mel. The legacy Python pre-bake bridge
(convert-cosyvoice3-voices-to-gguf.py) remains as an automatic
fallback when the companions are absent.
Tensor naming
Conventional naming for all three GGUFs:
- LLM β llama.cpp-standard
token_embd,blk.K.{attn,ffn}_*,output_norm,output, plus CV3-specificcosyvoice3.speech_embd.weight(input embedding, vocab 6761) andcosyvoice3.speech_lm_head.weight(output head). - Flow β
cosyvoice3.flow.{input_embd,pre_la,spk_affine,dit.*}matching the upstreamCausalMaskedDiffWithDiTmodule tree. - HiFT β
cosyvoice3.hift.{conv_pre,ups.K,resblocks.K.*,source_*, m_source,f0.*,conv_post}with weight-norm pre-resolved on the Python converter side (g Β· v / βvβ).
License
The model weights are Apache-2.0 (inherited from the upstream
model). Free for commercial use. The zero_shot voice prompt is the
asset/zero_shot_prompt.wav clip from the Apache-2.0 CosyVoice repo.
The fleurs-{en,de,zh,ja,fr,es,ko} voice prompts are derived (trimmed +
loudness-normalised) from Google's FLEURS corpus, licensed
CC BY 4.0 β
commercial use permitted, attribution required:
FLEURS (Few-shot Learning Evaluation of Universal Representations of Speech), Conneau et al., 2022 β https://huggingface.co/datasets/google/fleurs, licensed CC BY 4.0. The prompt clips here are trimmed excerpts, loudness-normalised; no other modification.
All eight baked voices are therefore clean for commercial use under permissive licenses (Apache-2.0 / CC BY 4.0).
Related links
- Upstream: FunAudioLLM/Fun-CosyVoice3-0.5B-2512
- Project page: funaudiollm.github.io/cosyvoice3
- Code: github.com/FunAudioLLM/CosyVoice
- CrispASR: github.com/CrispStrobe/CrispASR
- Downloads last month
- 516
Model tree for cstr/cosyvoice3-0.5b-2512-GGUF
Base model
FunAudioLLM/Fun-CosyVoice3-0.5B-2512