Step-Audio-TTS-3B

Step-Audio-TTS-3B represents the industry's first Text-to-Speech (TTS) model trained on a large-scale synthetic dataset utilizing the LLM-Chat paradigm. It has achieved SOTA Character Error Rate (CER) results on the SEED TTS Eval benchmark. The model supports multiple languages, a variety of emotional expressions, and diverse voice style controls. Notably, Step-Audio-TTS-3B is also the first TTS model in the industry capable of generating RAP and Humming, marking a significant advancement in the field of speech synthesis.

This repository provides the model weights for StepAudio-TTS-3B, which is a dual-codebook trained LLM (Large Language Model) for text-to-speech synthesis. Additionally, it includes a vocoder trained using the dual-codebook approach, as well as a specialized vocoder specifically optimized for humming generation. These resources collectively enable high-quality speech synthesis and humming capabilities, leveraging the advanced dual-codebook training methodology.

Performance comparison of content consistency (CER/WER) between GLM-4-Voice and MinMo.

Model test-zh test-en
CER (%) ↓ WER (%) ↓
GLM-4-Voice 2.19 2.91
MinMo 2.48 2.90
Step-Audio 1.53 2.71

Results of TTS Models on SEED Test Sets.

  • StepAudio-TTS-3B-Single denotes dual-codebook backbone with single-codebook vocoder*
Model test-zh test-en
CER (%) ↓ SS ↑ WER (%) ↓ SS ↑
FireRedTTS 1.51 0.630 3.82 0.460
MaskGCT 2.27 0.774 2.62 0.774
CosyVoice 3.63 0.775 4.29 0.699
CosyVoice 2 1.45 0.806 2.57 0.736
CosyVoice 2-S 1.45 0.812 2.38 0.743
Step-Audio-TTS-3B-Single 1.37 0.802 2.52 0.704
Step-Audio-TTS-3B 1.31 0.733 2.31 0.660
Step-Audio-TTS 1.17 0.73 2.0 0.660

Performance comparison of Dual-codebook Resynthesis with Cosyvoice.

Token test-zh test-en
CER (%) ↓ SS ↑ WER (%) ↓ SS ↑
Groundtruth 0.972 - 2.156 -
CosyVoice 2.857 0.849 4.519 0.807
Step-Audio-TTS-3B 2.192 0.784 3.585 0.742

More information

For more information, please refer to our repository: Step-Audio.

Downloads last month
628
Safetensors
Model size
3.53B params
Tensor type
BF16
Β·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Space using stepfun-ai/Step-Audio-TTS-3B 1

Collection including stepfun-ai/Step-Audio-TTS-3B