YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

ZipVoice.AXERA

ZipVoice AXERA 板端推理 demo。

功能

支持中文和英文语音生成。
支持语音克隆。
支持 ZipVoice、ZipVoice Distill

模型说明

ZipVoice Distill 是 ZipVoice 的蒸馏版本，主要优势是在较小性能损失下提升推理速度。初步测试，AX650 ZipVoice Distill 在长文本场景下相比基础版模型约有 3 倍速度提升，RTF 在 0.3 左右，效果没有明显下降。

AX630C 版本当前推理结果差，RTF 约为 1.5 左右，需要继续调优。

模型转换

模型量化参考:

支持平台

AX650
- AX650 demo 板
- M4N-Dock（爱芯派Pro）
- M.2 Accelerator Card

目录结构

ZipVoice.AXERA/
├── assets/
│   ├── moss_prompts/
│   └── paragraphs/
├── models/
│   ├── zipvoice_ax650/
│   ├── zipvoice_distill_ax650/
│   └── zipvoice_distill_ax630C/
├── resources/
│   ├── vocos-mel-24khz/
│   └── zipvoice_hf/
├── scripts/
├── infer_zipvoice_axera.py
├── requirements.txt
└── README.md

环境

安装 pyaxengine：

pip3 install axengine-x.x.x-py3-none-any.whl

安装依赖：

conda create -n ZipVoice python=3.10
conda activate ZipVoice
pip3 install -r requirements.txt

推理命令

进入目录：

cd ZipVoice.AXERA

AX650 ZipVoice

中文句子：

python3 infer_zipvoice_axera.py \
  --model-name zipvoice_ax650 \
  --text "今天午后天气很好，我打开窗户，听见远处有人聊天，水杯也轻轻晃了一下。" \
  --prompt-text "不管怎么样我和汤姆还是要感谢贝尔卡金的援手" \
  --prompt-wav assets/moss_prompts/zh_1_4p5s.wav \
  --output-wav outputs/zh_sentence_ax650.wav \
  --seed 42

推理结果：

推理耗时: 5.781s
生成语音时长: 6.411s
RTF: 0.9018

音频：outputs/zh_sentence_ax650.wav
提示音：assets/moss_prompts/zh_1_4p5s.wav

英文句子：

python3 infer_zipvoice_axera.py \
  --model-name zipvoice_ax650 \
  --text "This morning, a small train left the station, carrying sleepy passengers toward a bright coastal town." \
  --prompt-text "This is almost twice the current industry production level per train." \
  --prompt-wav assets/moss_prompts/en_4_4p5s.wav \
  --output-wav outputs/en_sentence_ax650.wav \
  --seed 42

推理结果：

推理耗时: 5.711s
生成语音时长: 6.411s
RTF: 0.8909

音频：outputs/en_sentence_ax650.wav
提示音：assets/moss_prompts/en_4_4p5s.wav

中文段落：

python3 infer_zipvoice_axera.py \
  --model-name zipvoice_ax650 \
  --text-file assets/paragraphs/zh_ginkgo.txt \
  --prompt-text "不管怎么样我和汤姆还是要感谢贝尔卡金的援手" \
  --prompt-wav assets/moss_prompts/zh_1_4p5s.wav \
  --output-wav outputs/zh_long_paragraph_ax650.wav \
  --seed 42

推理结果：

推理耗时: 40.292s
生成语音时长: 44.744s
RTF: 0.9005

音频：outputs/zh_long_paragraph_ax650.wav
提示音：assets/moss_prompts/zh_1_4p5s.wav

英文段落：

python3 infer_zipvoice_axera.py \
  --model-name zipvoice_ax650 \
  --text-file assets/paragraphs/en_scavenger.txt \
  --prompt-text "This is almost twice the current industry production level per train." \
  --prompt-wav assets/moss_prompts/en_4_4p5s.wav \
  --output-wav outputs/en_long_paragraph_ax650.wav \
  --seed 42

推理结果：

推理耗时: 62.161s
生成语音时长: 64.749s
RTF: 0.9600

音频：outputs/en_long_paragraph_ax650.wav
提示音：assets/moss_prompts/en_4_4p5s.wav

AX650 ZipVoice Distill

中文句子：

python3 infer_zipvoice_axera.py \
  --model-name zipvoice_distill_ax650 \
  --text "今天午后天气很好，我打开窗户，听见远处有人聊天，水杯也轻轻晃了一下。" \
  --prompt-text "不管怎么样我和汤姆还是要感谢贝尔卡金的援手" \
  --prompt-wav assets/moss_prompts/zh_1_4p5s.wav \
  --output-wav outputs/zh_sentence_distill_ax650.wav \
  --seed 42

推理结果：

推理耗时: 1.992s
生成语音时长: 6.411s
RTF: 0.3107

音频：outputs/zh_sentence_distill_ax650.wav
提示音：assets/moss_prompts/zh_1_4p5s.wav

英文句子：

python3 infer_zipvoice_axera.py \
  --model-name zipvoice_distill_ax650 \
  --text "This morning, a small train left the station, carrying sleepy passengers toward a bright coastal town." \
  --prompt-text "This is almost twice the current industry production level per train." \
  --prompt-wav assets/moss_prompts/en_4_4p5s.wav \
  --output-wav outputs/en_sentence_distill_ax650.wav \
  --seed 42

推理结果：

推理耗时: 2.045s
生成语音时长: 6.411s
RTF: 0.3189

音频：outputs/en_sentence_distill_ax650.wav
提示音：assets/moss_prompts/en_4_4p5s.wav

中文段落：

python3 infer_zipvoice_axera.py \
  --model-name zipvoice_distill_ax650 \
  --text-file assets/paragraphs/zh_ginkgo.txt \
  --prompt-text "不管怎么样我和汤姆还是要感谢贝尔卡金的援手" \
  --prompt-wav assets/moss_prompts/zh_1_4p5s.wav \
  --output-wav outputs/zh_long_paragraph_distill_ax650.wav \
  --seed 42

推理结果：

推理耗时: 13.457s
生成语音时长: 44.744s
RTF: 0.3008

音频：outputs/zh_long_paragraph_distill_ax650.wav
提示音：assets/moss_prompts/zh_1_4p5s.wav

英文段落：

python3 infer_zipvoice_axera.py \
  --model-name zipvoice_distill_ax650 \
  --text-file assets/paragraphs/en_scavenger.txt \
  --prompt-text "This is almost twice the current industry production level per train." \
  --prompt-wav assets/moss_prompts/en_4_4p5s.wav \
  --output-wav outputs/en_long_paragraph_distill_ax650.wav \
  --seed 42

推理结果：

推理耗时: 19.715s
生成语音时长: 64.749s
RTF: 0.3045

音频：outputs/en_long_paragraph_distill_ax650.wav
提示音：assets/moss_prompts/en_4_4p5s.wav

参数说明

--model-name：选择模型目录。可选 zipvoice_ax650、zipvoice_distill_ax650、zipvoice_distill_ax630C。
--prompt-wav：参考音频，用于控制音色，建议 3-5s。
--prompt-text：参考音频对应文本，必须尽量和 prompt-wav 内容一致。
--num-step：采样步数。默认从模型目录的 runtime_config.json 读取。
--max-feat-len：decoder 固定 feature 长度，当前模型均为 1024。

参考

ZipVoice

Downloads last month: 172

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support