Text-to-Speech
CosyVoice
ONNX
Safetensors
tts
multilingual
multispeaker
korean
japanese
chinese
english
Instructions to use gglabs/cosyvoice3-multilingual-multispeaker-v3_ep49 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- CosyVoice
How to use gglabs/cosyvoice3-multilingual-multispeaker-v3_ep49 with CosyVoice:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
CosyVoice3 Multilingual Multispeaker TTS Model (v3_ep49)
CosyVoice3 κΈ°λ° λ€κ΅μ΄/λ€νμ TTS λͺ¨λΈμ λλ€. νκ΅μ΄, μμ΄, μΌλ³Έμ΄, μ€κ΅μ΄ 4κ° μΈμ΄λ₯Ό μ§μνλ©°, 2λͺ μ νμ μμ±μ΄ ν¬ν¨λμ΄ μμ΅λλ€.
π Model Information
| νλͺ© | λ΄μ© |
|---|---|
| Base Model | CosyVoice3 (Fun-CosyVoice3-0.5B) |
| Training Epochs | 49 epochs |
| Training Date | 2026λ 3μ |
| Model Size | 6.6GB |
| Languages | Korean (ko), English (en), Japanese (ja), Chinese (zh) |
π₯ Speakers
| Speaker ID | μ΄λ¦ | μ€λͺ |
|---|---|---|
nalnani |
λ λλ | μ¬μ± νκ΅μ΄ νμ, λ°κ³ μμ°μ€λ¬μ΄ μμ± |
hwangjunhee |
ν©μ€ν¬ | λ¨μ± νκ΅μ΄ νμ, μ°¨λΆνκ³ μμ μ μΈ μμ± |
Speaker Details
- νμ΅ λ°μ΄ν°: κ° νμ μ½ 1~2μκ° λΆλμ κ³ νμ§ λ Ήμ λ°μ΄ν°
- μ§μ μΈμ΄: λͺ¨λ νμκ° ko/en/ja/zh 4κ° μΈμ΄ μ§μ
- μμ± νΉμ±: Instruct κΈ°λ°μΌλ‘ κ°μ , μλ, ν€ μ‘°μ κ°λ₯
π Model Files
βββ llm.pt # LLM weights (1.9GB)
βββ flow.pt # Flow Matching weights (1.3GB)
βββ hift.pt # HiFiGAN vocoder (80MB)
βββ speech_tokenizer_v3.onnx # Speech tokenizer (925MB)
βββ campplus.onnx # Speaker embedding extractor (27MB)
βββ spk2info.pt # Speaker embeddings (3KB)
βββ cosyvoice3.yaml # Model config
βββ CosyVoice-BlankEN/ # Qwen2 tokenizer
βββ vllm/ # vLLM-optimized LLM
π Usage
With CosyVoice3 API Server
# Clone the server repo
git clone https://github.com/GoodGangLabs/cosyvoice3-instruct-multilanguage-multispeaker-server.git
# Download model from HuggingFace
huggingface-cli download gglabs/cosyvoice3-multilingual-multispeaker-v3_ep49 --local-dir ./models/v3_ep49
# Run with Docker
docker-compose up -d
# Test TTS
curl -X POST http://localhost:8090/tts \
-H "Content-Type: application/json" \
-d '{"text": "μλ
νμΈμ", "language": "ko", "spk_id": "nalnani"}'
Direct Python Usage
from cosyvoice.cli.cosyvoice import CosyVoice3
# Load model
model = CosyVoice3("./models/v3_ep49", load_vllm=True)
# Generate speech
for result in model.inference_instruct2(
"μλ
νμΈμ, λ°κ°μ΅λλ€.",
"νκ΅μ΄λ‘ μμ°μ€λ½κ² λ§ν΄μ£ΌμΈμ.",
"nalnani"
):
# result contains audio waveform
pass
β‘ Performance
vLLM + TensorRT μ΅μ ν μ μ±λ₯:
| Text Length | Latency |
|---|---|
| Short (5μ) | ~0.5s |
| Medium (20μ) | ~0.9s |
| Long (50μ) | ~1.5s |
π§ Training
Training Configuration
- Base: Fun-CosyVoice3-0.5B (Alibaba)
- Fine-tuning: SFT (Supervised Fine-Tuning)
- Epochs: 49
- Learning Rate: 1e-5
- Hardware: NVIDIA H100 80GB
Data Preparation
- νκ΅μ΄ μμ± λ°μ΄ν° μμ§ λ° μ μ¬
- μμ±-ν μ€νΈ μ λ ¬ (Forced Alignment)
- λ Έμ΄μ¦ μ κ±° λ° μ κ·ν
- Multi-language νμ₯ (λ²μ + Cross-lingual transfer)
π License
Apache 2.0
π Links
- API Server: GitHub
- Base Model: CosyVoice
- Organization: GoodGang Labs
π§ Contact
GoodGang Labs - https://goodganglabs.com
- Downloads last month
- 49