YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

ZipVoice.AXERA

ZipVoice AXERA 板端推理 demo。

功能

  • 支持中文和英文语音生成。
  • 支持语音克隆。
  • 支持 ZipVoice、ZipVoice Distill

模型说明

ZipVoice Distill 是 ZipVoice 的蒸馏版本,主要优势是在较小性能损失下提升推理速度。初步测试,AX650 ZipVoice Distill 在长文本场景下相比基础版模型约有 3 倍速度提升,RTF 在 0.3 左右,效果没有明显下降。

AX630C 版本当前推理结果差,RTF 约为 1.5 左右,需要继续调优。

模型转换

模型量化参考:

支持平台

目录结构

ZipVoice.AXERA/
├── assets/
│   ├── moss_prompts/
│   └── paragraphs/
├── models/
│   ├── zipvoice_ax650/
│   ├── zipvoice_distill_ax650/
│   └── zipvoice_distill_ax630C/
├── resources/
│   ├── vocos-mel-24khz/
│   └── zipvoice_hf/
├── scripts/
├── infer_zipvoice_axera.py
├── requirements.txt
└── README.md

环境

安装 pyaxengine:

pip3 install axengine-x.x.x-py3-none-any.whl

安装依赖:

conda create -n ZipVoice python=3.10
conda activate ZipVoice
pip3 install -r requirements.txt

推理命令

进入目录:

cd ZipVoice.AXERA

AX650 ZipVoice

中文句子:

python3 infer_zipvoice_axera.py \
  --model-name zipvoice_ax650 \
  --text "今天午后天气很好,我打开窗户,听见远处有人聊天,水杯也轻轻晃了一下。" \
  --prompt-text "不管怎么样我和汤姆还是要感谢贝尔卡金的援手" \
  --prompt-wav assets/moss_prompts/zh_1_4p5s.wav \
  --output-wav outputs/zh_sentence_ax650.wav \
  --seed 42

推理结果:

推理耗时: 5.781s
生成语音时长: 6.411s
RTF: 0.9018

音频:outputs/zh_sentence_ax650.wav
提示音:assets/moss_prompts/zh_1_4p5s.wav

英文句子:

python3 infer_zipvoice_axera.py \
  --model-name zipvoice_ax650 \
  --text "This morning, a small train left the station, carrying sleepy passengers toward a bright coastal town." \
  --prompt-text "This is almost twice the current industry production level per train." \
  --prompt-wav assets/moss_prompts/en_4_4p5s.wav \
  --output-wav outputs/en_sentence_ax650.wav \
  --seed 42

推理结果:

推理耗时: 5.711s
生成语音时长: 6.411s
RTF: 0.8909

音频:outputs/en_sentence_ax650.wav
提示音:assets/moss_prompts/en_4_4p5s.wav

中文段落:

python3 infer_zipvoice_axera.py \
  --model-name zipvoice_ax650 \
  --text-file assets/paragraphs/zh_ginkgo.txt \
  --prompt-text "不管怎么样我和汤姆还是要感谢贝尔卡金的援手" \
  --prompt-wav assets/moss_prompts/zh_1_4p5s.wav \
  --output-wav outputs/zh_long_paragraph_ax650.wav \
  --seed 42

推理结果:

推理耗时: 40.292s
生成语音时长: 44.744s
RTF: 0.9005

音频:outputs/zh_long_paragraph_ax650.wav
提示音:assets/moss_prompts/zh_1_4p5s.wav

英文段落:

python3 infer_zipvoice_axera.py \
  --model-name zipvoice_ax650 \
  --text-file assets/paragraphs/en_scavenger.txt \
  --prompt-text "This is almost twice the current industry production level per train." \
  --prompt-wav assets/moss_prompts/en_4_4p5s.wav \
  --output-wav outputs/en_long_paragraph_ax650.wav \
  --seed 42

推理结果:

推理耗时: 62.161s
生成语音时长: 64.749s
RTF: 0.9600

音频:outputs/en_long_paragraph_ax650.wav
提示音:assets/moss_prompts/en_4_4p5s.wav

AX650 ZipVoice Distill

中文句子:

python3 infer_zipvoice_axera.py \
  --model-name zipvoice_distill_ax650 \
  --text "今天午后天气很好,我打开窗户,听见远处有人聊天,水杯也轻轻晃了一下。" \
  --prompt-text "不管怎么样我和汤姆还是要感谢贝尔卡金的援手" \
  --prompt-wav assets/moss_prompts/zh_1_4p5s.wav \
  --output-wav outputs/zh_sentence_distill_ax650.wav \
  --seed 42

推理结果:

推理耗时: 1.992s
生成语音时长: 6.411s
RTF: 0.3107

音频:outputs/zh_sentence_distill_ax650.wav
提示音:assets/moss_prompts/zh_1_4p5s.wav

英文句子:

python3 infer_zipvoice_axera.py \
  --model-name zipvoice_distill_ax650 \
  --text "This morning, a small train left the station, carrying sleepy passengers toward a bright coastal town." \
  --prompt-text "This is almost twice the current industry production level per train." \
  --prompt-wav assets/moss_prompts/en_4_4p5s.wav \
  --output-wav outputs/en_sentence_distill_ax650.wav \
  --seed 42

推理结果:

推理耗时: 2.045s
生成语音时长: 6.411s
RTF: 0.3189

音频:outputs/en_sentence_distill_ax650.wav
提示音:assets/moss_prompts/en_4_4p5s.wav

中文段落:

python3 infer_zipvoice_axera.py \
  --model-name zipvoice_distill_ax650 \
  --text-file assets/paragraphs/zh_ginkgo.txt \
  --prompt-text "不管怎么样我和汤姆还是要感谢贝尔卡金的援手" \
  --prompt-wav assets/moss_prompts/zh_1_4p5s.wav \
  --output-wav outputs/zh_long_paragraph_distill_ax650.wav \
  --seed 42

推理结果:

推理耗时: 13.457s
生成语音时长: 44.744s
RTF: 0.3008

音频:outputs/zh_long_paragraph_distill_ax650.wav
提示音:assets/moss_prompts/zh_1_4p5s.wav

英文段落:

python3 infer_zipvoice_axera.py \
  --model-name zipvoice_distill_ax650 \
  --text-file assets/paragraphs/en_scavenger.txt \
  --prompt-text "This is almost twice the current industry production level per train." \
  --prompt-wav assets/moss_prompts/en_4_4p5s.wav \
  --output-wav outputs/en_long_paragraph_distill_ax650.wav \
  --seed 42

推理结果:

推理耗时: 19.715s
生成语音时长: 64.749s
RTF: 0.3045

音频:outputs/en_long_paragraph_distill_ax650.wav
提示音:assets/moss_prompts/en_4_4p5s.wav

参数说明

  • --model-name:选择模型目录。可选 zipvoice_ax650zipvoice_distill_ax650zipvoice_distill_ax630C
  • --prompt-wav:参考音频,用于控制音色,建议 3-5s。
  • --prompt-text:参考音频对应文本,必须尽量和 prompt-wav 内容一致。
  • --num-step:采样步数。默认从模型目录的 runtime_config.json 读取。
  • --max-feat-len:decoder 固定 feature 长度,当前模型均为 1024。

参考

Downloads last month
172
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support