CosyVoice3 Multilingual Multispeaker TTS Model (v3_ep49)

CosyVoice3 기반 λ‹€κ΅­μ–΄/λ‹€ν™”μž TTS λͺ¨λΈμž…λ‹ˆλ‹€. ν•œκ΅­μ–΄, μ˜μ–΄, 일본어, 쀑ꡭ어 4개 μ–Έμ–΄λ₯Ό μ§€μ›ν•˜λ©°, 2λͺ…μ˜ ν™”μž μŒμ„±μ΄ ν¬ν•¨λ˜μ–΄ μžˆμŠ΅λ‹ˆλ‹€.

πŸ“‹ Model Information

ν•­λͺ© λ‚΄μš©
Base Model CosyVoice3 (Fun-CosyVoice3-0.5B)
Training Epochs 49 epochs
Training Date 2026λ…„ 3μ›”
Model Size 6.6GB
Languages Korean (ko), English (en), Japanese (ja), Chinese (zh)

πŸ‘₯ Speakers

Speaker ID 이름 μ„€λͺ…
nalnani λ‚ λ‚˜λ‹ˆ μ—¬μ„± ν•œκ΅­μ–΄ ν™”μž, 밝고 μžμ—°μŠ€λŸ¬μš΄ μŒμ„±
hwangjunhee 황쀀희 남성 ν•œκ΅­μ–΄ ν™”μž, μ°¨λΆ„ν•˜κ³  μ•ˆμ •μ μΈ μŒμ„±

Speaker Details

  • ν•™μŠ΅ 데이터: 각 ν™”μž μ•½ 1~2μ‹œκ°„ λΆ„λŸ‰μ˜ κ³ ν’ˆμ§ˆ λ…ΉμŒ 데이터
  • 지원 μ–Έμ–΄: λͺ¨λ“  ν™”μžκ°€ ko/en/ja/zh 4개 μ–Έμ–΄ 지원
  • μŒμ„± νŠΉμ„±: Instruct 기반으둜 감정, 속도, 톀 쑰절 κ°€λŠ₯

πŸ“ Model Files

β”œβ”€β”€ llm.pt                    # LLM weights (1.9GB)
β”œβ”€β”€ flow.pt                   # Flow Matching weights (1.3GB)
β”œβ”€β”€ hift.pt                   # HiFiGAN vocoder (80MB)
β”œβ”€β”€ speech_tokenizer_v3.onnx  # Speech tokenizer (925MB)
β”œβ”€β”€ campplus.onnx             # Speaker embedding extractor (27MB)
β”œβ”€β”€ spk2info.pt               # Speaker embeddings (3KB)
β”œβ”€β”€ cosyvoice3.yaml           # Model config
β”œβ”€β”€ CosyVoice-BlankEN/        # Qwen2 tokenizer
└── vllm/                     # vLLM-optimized LLM

πŸš€ Usage

With CosyVoice3 API Server

# Clone the server repo
git clone https://github.com/GoodGangLabs/cosyvoice3-instruct-multilanguage-multispeaker-server.git

# Download model from HuggingFace
huggingface-cli download gglabs/cosyvoice3-multilingual-multispeaker-v3_ep49 --local-dir ./models/v3_ep49

# Run with Docker
docker-compose up -d

# Test TTS
curl -X POST http://localhost:8090/tts \
  -H "Content-Type: application/json" \
  -d '{"text": "μ•ˆλ…•ν•˜μ„Έμš”", "language": "ko", "spk_id": "nalnani"}'

Direct Python Usage

from cosyvoice.cli.cosyvoice import CosyVoice3

# Load model
model = CosyVoice3("./models/v3_ep49", load_vllm=True)

# Generate speech
for result in model.inference_instruct2(
    "μ•ˆλ…•ν•˜μ„Έμš”, λ°˜κ°‘μŠ΅λ‹ˆλ‹€.",
    "ν•œκ΅­μ–΄λ‘œ μžμ—°μŠ€λŸ½κ²Œ λ§ν•΄μ£Όμ„Έμš”.",
    "nalnani"
):
    # result contains audio waveform
    pass

⚑ Performance

vLLM + TensorRT μ΅œμ ν™” μ‹œ μ„±λŠ₯:

Text Length Latency
Short (5자) ~0.5s
Medium (20자) ~0.9s
Long (50자) ~1.5s

πŸ”§ Training

Training Configuration

  • Base: Fun-CosyVoice3-0.5B (Alibaba)
  • Fine-tuning: SFT (Supervised Fine-Tuning)
  • Epochs: 49
  • Learning Rate: 1e-5
  • Hardware: NVIDIA H100 80GB

Data Preparation

  • ν•œκ΅­μ–΄ μŒμ„± 데이터 μˆ˜μ§‘ 및 전사
  • μŒμ„±-ν…μŠ€νŠΈ μ •λ ¬ (Forced Alignment)
  • λ…Έμ΄μ¦ˆ 제거 및 μ •κ·œν™”
  • Multi-language ν™•μž₯ (λ²ˆμ—­ + Cross-lingual transfer)

πŸ“œ License

Apache 2.0

πŸ”— Links

πŸ“§ Contact

GoodGang Labs - https://goodganglabs.com

Downloads last month
49
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support