Blane187 commited on
Commit
06ab6dd
1 Parent(s): adbdefc

Upload folder using huggingface_hub

Browse files
README.md CHANGED
@@ -1,52 +1,213 @@
1
- ---
2
- license: mit
3
- ---
4
 
 
 
5
 
6
- # easyGUI
 
7
 
8
- `easyGUI` is a user-friendly voice conversion framework based on VITS, designed to eliminate timbre leakage by replacing input features with those from the training set. It's efficient even on lower-end GPUs, requiring only about 10 minutes of low-noise speech data for good results. The framework features a simple web interface, supports A card and I card acceleration, and uses the advanced RMVPE algorithm for pitch extraction.
9
 
10
- ## Installation
 
 
11
 
12
- ### Prerequisites
13
- - Python 3.8 or higher
14
 
15
- ### Installation Steps
16
- 1. **Install Pytorch**:
17
- ```bash
18
- pip install torch torchvision torchaudio
19
- ```
20
-
21
 
22
- 2. **Install Dependencies**:
23
- ```bash
24
- pip install -r requirements.txt
25
- ```
26
-
27
 
28
- 3
29
 
30
- ### Additional Setup
31
- - **Download Assets**:
32
- Download necessary models and files using the scripts in the `tools` directory.
33
- - **Install FFmpeg**:
34
- ```bash
35
- sudo apt install ffmpeg
36
- ```
37
 
38
- ## Usage
39
- Start the WebUI:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
40
  ```bash
41
- python demo.py
42
  ```
43
 
 
 
 
 
44
 
45
- ## Features
46
- - Top1 retrieval to replace input features
47
- - Fast training on less powerful GPUs
48
- - Model merging to change timbre
49
- - Advanced pitch extraction with RMVPE
 
 
 
 
 
 
 
50
 
 
 
 
 
 
 
 
 
 
 
51
 
52
- ---
 
 
 
 
1
+ <div align="center">
 
 
2
 
3
+ <h1>Retrieval-based-Voice-Conversion-WebUI</h1>
4
+ 一个基于VITS的简单易用的变声框架<br><br>
5
 
6
+ [![madewithlove](https://img.shields.io/badge/made_with-%E2%9D%A4-red?style=for-the-badge&labelColor=orange
7
+ )](https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI)
8
 
9
+ <img src="https://counter.seku.su/cmoe?name=rvc&theme=r34" /><br>
10
 
11
+ [![Open In Colab](https://img.shields.io/badge/Colab-F9AB00?style=for-the-badge&logo=googlecolab&color=525252)](https://colab.research.google.com/github/RVC-Project/Retrieval-based-Voice-Conversion-WebUI/blob/main/Retrieval_based_Voice_Conversion_WebUI.ipynb)
12
+ [![Licence](https://img.shields.io/badge/LICENSE-MIT-green.svg?style=for-the-badge)](https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI/blob/main/LICENSE)
13
+ [![Huggingface](https://img.shields.io/badge/🤗%20-Spaces-yellow.svg?style=for-the-badge)](https://huggingface.co/lj1995/VoiceConversionWebUI/tree/main/)
14
 
15
+ [![Discord](https://img.shields.io/badge/RVC%20Developers-Discord-7289DA?style=for-the-badge&logo=discord&logoColor=white)](https://discord.gg/HcsmBBGyVk)
 
16
 
17
+ [**更新日志**](https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI/blob/main/docs/Changelog_CN.md) | [**常见问题解答**](https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI/wiki/%E5%B8%B8%E8%A7%81%E9%97%AE%E9%A2%98%E8%A7%A3%E7%AD%94) | [**AutoDL·5毛钱训练AI歌手**](https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI/wiki/Autodl%E8%AE%AD%E7%BB%83RVC%C2%B7AI%E6%AD%8C%E6%89%8B%E6%95%99%E7%A8%8B) | [**对照实验记录**](https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI/wiki/Autodl%E8%AE%AD%E7%BB%83RVC%C2%B7AI%E6%AD%8C%E6%89%8B%E6%95%99%E7%A8%8B](https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI/wiki/%E5%AF%B9%E7%85%A7%E5%AE%9E%E9%AA%8C%C2%B7%E5%AE%9E%E9%AA%8C%E8%AE%B0%E5%BD%95)) | [**在线演示**](https://modelscope.cn/studios/FlowerCry/RVCv2demo)
 
 
 
 
 
18
 
19
+ [**English**](./docs/en/README.en.md) | [**中文简体**](./README.md) | [**日本語**](./docs/jp/README.ja.md) | [**한국어**](./docs/kr/README.ko.md) ([**韓國語**](./docs/kr/README.ko.han.md)) | [**Français**](./docs/fr/README.fr.md) | [**Türkçe**](./docs/tr/README.tr.md) | [**Português**](./docs/pt/README.pt.md)
 
 
 
 
20
 
21
+ </div>
22
 
23
+ > 底模使用接近50小时的开源高质量VCTK训练集训练,无版权方面的顾虑,请大家放心使用
 
 
 
 
 
 
24
 
25
+ > 请期待RVCv3的底模,参数更大,数据更大,效果更好,基本持平的推理速度,需要训练数据量更少。
26
+
27
+ <table>
28
+ <tr>
29
+ <td align="center">训练推理界面</td>
30
+ <td align="center">实时变声界面</td>
31
+ </tr>
32
+ <tr>
33
+ <td align="center"><img src="https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI/assets/129054828/092e5c12-0d49-4168-a590-0b0ef6a4f630"></td>
34
+ <td align="center"><img src="https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI/assets/129054828/730b4114-8805-44a1-ab1a-04668f3c30a6"></td>
35
+ </tr>
36
+ <tr>
37
+ <td align="center">go-web.bat</td>
38
+ <td align="center">go-realtime-gui.bat</td>
39
+ </tr>
40
+ <tr>
41
+ <td align="center">可以自由选择想要执行的操作。</td>
42
+ <td align="center">我们已经实现端到端170ms延迟。如使用ASIO输入输出设备,已能实现端到端90ms延迟,但非常依赖硬件驱动支持。</td>
43
+ </tr>
44
+ </table>
45
+
46
+ ## 简介
47
+ 本仓库具有以下特点
48
+ + 使用top1检索替换输入源特征为训练集特征来杜绝音色泄漏
49
+ + 即便在相对较差的显卡上也能快速训练
50
+ + 使用少量数据进行训练也能得到较好结果(推荐至少收集10分钟低底噪语音数据)
51
+ + 可以通过模型融合来改变音色(借助ckpt处理选项卡中的ckpt-merge)
52
+ + 简单易用的网页界面
53
+ + 可调用UVR5模型来快速分离人声和伴奏
54
+ + 使用最先进的[人声音高提取算法InterSpeech2023-RMVPE](#参考项目)根绝哑音问题。效果最好(显著地)但比crepe_full更快、资源占用更小
55
+ + A卡I卡加速支持
56
+
57
+ 点此查看我们的[演示��频](https://www.bilibili.com/video/BV1pm4y1z7Gm/) !
58
+
59
+ ## 环境配置
60
+ 以下指令需在 Python 版本大于3.8的环境中执行。
61
+
62
+ ### Windows/Linux/MacOS等平台通用方法
63
+ 下列方法任选其一。
64
+ #### 1. 通过 pip 安装依赖
65
+ 1. 安装Pytorch及其核心依赖,若已安装则跳过。参考自: https://pytorch.org/get-started/locally/
66
+ ```bash
67
+ pip install torch torchvision torchaudio
68
+ ```
69
+ 2. 如果是 win 系统 + Nvidia Ampere 架构(RTX30xx),根据 #21 的经验,需要指定 pytorch 对应的 cuda 版本
70
+ ```bash
71
+ pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117
72
+ ```
73
+ 3. 根据自己的显卡安装对应依赖
74
+ - N卡
75
+ ```bash
76
+ pip install -r requirements.txt
77
+ ```
78
+ - A卡/I卡
79
+ ```bash
80
+ pip install -r requirements-dml.txt
81
+ ```
82
+ - A卡ROCM(Linux)
83
+ ```bash
84
+ pip install -r requirements-amd.txt
85
+ ```
86
+ - I卡IPEX(Linux)
87
+ ```bash
88
+ pip install -r requirements-ipex.txt
89
+ ```
90
+
91
+ #### 2. 通过 poetry 来安装依赖
92
+ 安装 Poetry 依赖管理工具,若已安装则跳过。参考自: https://python-poetry.org/docs/#installation
93
+ ```bash
94
+ curl -sSL https://install.python-poetry.org | python3 -
95
+ ```
96
+
97
+ 通过 Poetry 安装依赖时,python 建议使用 3.7-3.10 版本,其余版本在安装 llvmlite==0.39.0 时会出现冲突
98
+ ```bash
99
+ poetry init -n
100
+ poetry env use "path to your python.exe"
101
+ poetry run pip install -r requirments.txt
102
+ ```
103
+
104
+ ### MacOS
105
+ 可以通过 `run.sh` 来安装依赖
106
+ ```bash
107
+ sh ./run.sh
108
+ ```
109
+
110
+ ## 其他预模型准备
111
+ RVC需要其他一些预模型来推理和训练。
112
+
113
+ 你可以从我们的[Hugging Face space](https://huggingface.co/lj1995/VoiceConversionWebUI/tree/main/)下载到这些模型。
114
+
115
+ ### 1. 下载 assets
116
+ 以下是一份清单,包括了所有RVC所需的预模型和其他文件的名称。你可以在`tools`文件夹找到下载它们的脚本。
117
+
118
+ - ./assets/hubert/hubert_base.pt
119
+
120
+ - ./assets/pretrained
121
+
122
+ - ./assets/uvr5_weights
123
+
124
+ 想使用v2版本模型的话,需要额外下载
125
+
126
+ - ./assets/pretrained_v2
127
+
128
+ ### 2. 安装 ffmpeg
129
+ 若ffmpeg和ffprobe已安装则跳过。
130
+
131
+ #### Ubuntu/Debian 用户
132
+ ```bash
133
+ sudo apt install ffmpeg
134
+ ```
135
+ #### MacOS 用户
136
+ ```bash
137
+ brew install ffmpeg
138
+ ```
139
+ #### Windows 用户
140
+ 下载后放置在根目录。
141
+ - 下载[ffmpeg.exe](https://huggingface.co/lj1995/VoiceConversionWebUI/blob/main/ffmpeg.exe)
142
+
143
+ - 下载[ffprobe.exe](https://huggingface.co/lj1995/VoiceConversionWebUI/blob/main/ffprobe.exe)
144
+
145
+ ### 3. 下载 rmvpe 人声音高提取算法所需文件
146
+
147
+ 如果你想使用最新的RMVPE人声音高提取算法,则你需要下载音高提取模型参数并放置于RVC根目录。
148
+
149
+ - 下载[rmvpe.pt](https://huggingface.co/lj1995/VoiceConversionWebUI/blob/main/rmvpe.pt)
150
+
151
+ #### 下载 rmvpe 的 dml 环境(可选, A卡/I卡用户)
152
+
153
+ - 下载[rmvpe.onnx](https://huggingface.co/lj1995/VoiceConversionWebUI/blob/main/rmvpe.onnx)
154
+
155
+ ### 4. AMD显卡Rocm(可选, 仅Linux)
156
+
157
+ 如果你想基于AMD的Rocm技术在Linux系统上运行RVC,请先在[这里](https://rocm.docs.amd.com/en/latest/deploy/linux/os-native/install.html)安装所需的驱动。
158
+
159
+ 若你使用的是Arch Linux,可以使用pacman来安装所需驱动:
160
+ ````
161
+ pacman -S rocm-hip-sdk rocm-opencl-sdk
162
+ ````
163
+ 对于某些型号的显卡,你可能需要额外配置如下的环境变量(如:RX6700XT):
164
+ ````
165
+ export ROCM_PATH=/opt/rocm
166
+ export HSA_OVERRIDE_GFX_VERSION=10.3.0
167
+ ````
168
+ 同时确保你的当前用户处于`render`与`video`用户组内:
169
+ ````
170
+ sudo usermod -aG render $USERNAME
171
+ sudo usermod -aG video $USERNAME
172
+ ````
173
+
174
+ ## 开始使用
175
+ ### 直接启动
176
+ 使用以下指令来启动 WebUI
177
  ```bash
178
+ python infer-web.py
179
  ```
180
 
181
+ 若先前使用 Poetry 安装依赖,则可以通过以下方式启动WebUI
182
+ ```bash
183
+ poetry run python infer-web.py
184
+ ```
185
 
186
+ ### 使用整合包
187
+ 下载并解压`RVC-beta.7z`
188
+ #### Windows 用户
189
+ 双击`go-web.bat`
190
+ #### MacOS 用户
191
+ ```bash
192
+ sh ./run.sh
193
+ ```
194
+ ### 对于需要使用IPEX技术的I卡用户(仅Linux)
195
+ ```bash
196
+ source /opt/intel/oneapi/setvars.sh
197
+ ```
198
 
199
+ ## 参考项目
200
+ + [ContentVec](https://github.com/auspicious3000/contentvec/)
201
+ + [VITS](https://github.com/jaywalnut310/vits)
202
+ + [HIFIGAN](https://github.com/jik876/hifi-gan)
203
+ + [Gradio](https://github.com/gradio-app/gradio)
204
+ + [FFmpeg](https://github.com/FFmpeg/FFmpeg)
205
+ + [Ultimate Vocal Remover](https://github.com/Anjok07/ultimatevocalremovergui)
206
+ + [audio-slicer](https://github.com/openvpi/audio-slicer)
207
+ + [Vocal pitch extraction:RMVPE](https://github.com/Dream-High/RMVPE)
208
+ + The pretrained model is trained and tested by [yxlllc](https://github.com/yxlllc/RMVPE) and [RVC-Boss](https://github.com/RVC-Boss).
209
 
210
+ ## 感谢所有贡献者作出的努力
211
+ <a href="https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI/graphs/contributors" target="_blank">
212
+ <img src="https://contrib.rocks/image?repo=RVC-Project/Retrieval-based-Voice-Conversion-WebUI" />
213
+ </a>
infer/modules/vc/__init__.py CHANGED
@@ -1,5 +0,0 @@
1
- from .pipeline import Pipeline
2
- from .modules import VC
3
- from .utils import get_index_path_from_model, load_hubert
4
- from .info import show_info
5
- from .hash import model_hash_ckpt, hash_id, hash_similarity
 
 
 
 
 
 
infer/modules/vc/modules.py CHANGED
@@ -1,6 +1,5 @@
1
  import traceback
2
  import logging
3
- import os
4
 
5
  logger = logging.getLogger(__name__)
6
 
@@ -10,10 +9,14 @@ import torch
10
  from io import BytesIO
11
 
12
  from infer.lib.audio import load_audio, wav2
13
- from rvc.synthesizer import get_synthesizer, load_synthesizer
14
- from .info import show_model_info
15
- from .pipeline import Pipeline
16
- from .utils import get_index_path_from_model, load_hubert
 
 
 
 
17
 
18
 
19
  class VC:
@@ -59,45 +62,71 @@ class VC:
59
  ) = None
60
  if torch.cuda.is_available():
61
  torch.cuda.empty_cache()
62
- elif torch.backends.mps.is_available():
63
- torch.mps.empty_cache()
64
  ###楼下不这么折腾清理不干净
65
- self.net_g, self.cpt = get_synthesizer(self.cpt, self.config.device)
66
  self.if_f0 = self.cpt.get("f0", 1)
67
  self.version = self.cpt.get("version", "v1")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
68
  del self.net_g, self.cpt
69
  if torch.cuda.is_available():
70
  torch.cuda.empty_cache()
71
- elif torch.backends.mps.is_available():
72
- torch.mps.empty_cache()
73
  return (
74
- (
75
- {"visible": False, "__type__": "update"},
76
- to_return_protect0,
77
- to_return_protect1,
78
- {"value": to_return_protect[2], "__type__": "update"},
79
- {"value": to_return_protect[3], "__type__": "update"},
80
- {"value": "", "__type__": "update"},
81
- )
82
- if to_return_protect
83
- else {"visible": True, "maximum": 0, "__type__": "update"}
 
 
 
84
  )
85
-
86
  person = f'{os.getenv("weight_root")}/{sid}'
87
  logger.info(f"Loading: {person}")
88
 
89
- self.net_g, self.cpt = load_synthesizer(person, self.config.device)
90
  self.tgt_sr = self.cpt["config"][-1]
91
  self.cpt["config"][-3] = self.cpt["weight"]["emb_g.weight"].shape[0] # n_spk
92
  self.if_f0 = self.cpt.get("f0", 1)
93
  self.version = self.cpt.get("version", "v1")
94
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
95
  if self.config.is_half:
96
  self.net_g = self.net_g.half()
97
  else:
98
  self.net_g = self.net_g.float()
99
- self.pipeline = Pipeline(self.tgt_sr, self.config)
100
 
 
101
  n_spk = self.cpt["config"][-3]
102
  index = {"value": get_index_path_from_model(sid), "__type__": "update"}
103
  logger.info("Select index: " + index["value"])
@@ -109,7 +138,6 @@ class VC:
109
  to_return_protect1,
110
  index,
111
  index,
112
- show_model_info(self.cpt),
113
  )
114
  if to_return_protect
115
  else {"visible": True, "maximum": n_spk, "__type__": "update"}
@@ -132,22 +160,18 @@ class VC:
132
  ):
133
  if input_audio_path is None:
134
  return "You need to upload an audio", None
135
- elif hasattr(input_audio_path, "name"):
136
- input_audio_path = str(input_audio_path.name)
137
  f0_up_key = int(f0_up_key)
138
  try:
139
  audio = load_audio(input_audio_path, 16000)
140
  audio_max = np.abs(audio).max() / 0.95
141
  if audio_max > 1:
142
- np.divide(audio, audio_max, audio)
143
  times = [0, 0, 0]
144
 
145
  if self.hubert_model is None:
146
- self.hubert_model = load_hubert(self.config.device, self.config.is_half)
147
 
148
  if file_index:
149
- if hasattr(file_index, "name"):
150
- file_index = str(file_index.name)
151
  file_index = (
152
  file_index.strip(" ")
153
  .strip('"')
@@ -166,6 +190,7 @@ class VC:
166
  self.net_g,
167
  sid,
168
  audio,
 
169
  times,
170
  f0_up_key,
171
  f0_method,
@@ -179,25 +204,25 @@ class VC:
179
  self.version,
180
  protect,
181
  f0_file,
182
- ).astype(np.int16)
183
  if self.tgt_sr != resample_sr >= 16000:
184
  tgt_sr = resample_sr
185
  else:
186
  tgt_sr = self.tgt_sr
187
  index_info = (
188
- "Index: %s." % file_index
189
  if os.path.exists(file_index)
190
  else "Index not used."
191
  )
192
  return (
193
- "Success.\n%s\nTime: npy: %.2fs, f0: %.2fs, infer: %.2fs."
194
  % (index_info, *times),
195
  (tgt_sr, audio_opt),
196
  )
197
- except Exception as e:
198
  info = traceback.format_exc()
199
  logger.warning(info)
200
- return str(e), None
201
 
202
  def vc_multi(
203
  self,
 
1
  import traceback
2
  import logging
 
3
 
4
  logger = logging.getLogger(__name__)
5
 
 
9
  from io import BytesIO
10
 
11
  from infer.lib.audio import load_audio, wav2
12
+ from infer.lib.infer_pack.models import (
13
+ SynthesizerTrnMs256NSFsid,
14
+ SynthesizerTrnMs256NSFsid_nono,
15
+ SynthesizerTrnMs768NSFsid,
16
+ SynthesizerTrnMs768NSFsid_nono,
17
+ )
18
+ from infer.modules.vc.pipeline import Pipeline
19
+ from infer.modules.vc.utils import *
20
 
21
 
22
  class VC:
 
62
  ) = None
63
  if torch.cuda.is_available():
64
  torch.cuda.empty_cache()
 
 
65
  ###楼下不这么折腾清理不干净
 
66
  self.if_f0 = self.cpt.get("f0", 1)
67
  self.version = self.cpt.get("version", "v1")
68
+ if self.version == "v1":
69
+ if self.if_f0 == 1:
70
+ self.net_g = SynthesizerTrnMs256NSFsid(
71
+ *self.cpt["config"], is_half=self.config.is_half
72
+ )
73
+ else:
74
+ self.net_g = SynthesizerTrnMs256NSFsid_nono(*self.cpt["config"])
75
+ elif self.version == "v2":
76
+ if self.if_f0 == 1:
77
+ self.net_g = SynthesizerTrnMs768NSFsid(
78
+ *self.cpt["config"], is_half=self.config.is_half
79
+ )
80
+ else:
81
+ self.net_g = SynthesizerTrnMs768NSFsid_nono(*self.cpt["config"])
82
  del self.net_g, self.cpt
83
  if torch.cuda.is_available():
84
  torch.cuda.empty_cache()
 
 
85
  return (
86
+ {"visible": False, "__type__": "update"},
87
+ {
88
+ "visible": True,
89
+ "value": to_return_protect0,
90
+ "__type__": "update",
91
+ },
92
+ {
93
+ "visible": True,
94
+ "value": to_return_protect1,
95
+ "__type__": "update",
96
+ },
97
+ "",
98
+ "",
99
  )
 
100
  person = f'{os.getenv("weight_root")}/{sid}'
101
  logger.info(f"Loading: {person}")
102
 
103
+ self.cpt = torch.load(person, map_location="cpu")
104
  self.tgt_sr = self.cpt["config"][-1]
105
  self.cpt["config"][-3] = self.cpt["weight"]["emb_g.weight"].shape[0] # n_spk
106
  self.if_f0 = self.cpt.get("f0", 1)
107
  self.version = self.cpt.get("version", "v1")
108
 
109
+ synthesizer_class = {
110
+ ("v1", 1): SynthesizerTrnMs256NSFsid,
111
+ ("v1", 0): SynthesizerTrnMs256NSFsid_nono,
112
+ ("v2", 1): SynthesizerTrnMs768NSFsid,
113
+ ("v2", 0): SynthesizerTrnMs768NSFsid_nono,
114
+ }
115
+
116
+ self.net_g = synthesizer_class.get(
117
+ (self.version, self.if_f0), SynthesizerTrnMs256NSFsid
118
+ )(*self.cpt["config"], is_half=self.config.is_half)
119
+
120
+ del self.net_g.enc_q
121
+
122
+ self.net_g.load_state_dict(self.cpt["weight"], strict=False)
123
+ self.net_g.eval().to(self.config.device)
124
  if self.config.is_half:
125
  self.net_g = self.net_g.half()
126
  else:
127
  self.net_g = self.net_g.float()
 
128
 
129
+ self.pipeline = Pipeline(self.tgt_sr, self.config)
130
  n_spk = self.cpt["config"][-3]
131
  index = {"value": get_index_path_from_model(sid), "__type__": "update"}
132
  logger.info("Select index: " + index["value"])
 
138
  to_return_protect1,
139
  index,
140
  index,
 
141
  )
142
  if to_return_protect
143
  else {"visible": True, "maximum": n_spk, "__type__": "update"}
 
160
  ):
161
  if input_audio_path is None:
162
  return "You need to upload an audio", None
 
 
163
  f0_up_key = int(f0_up_key)
164
  try:
165
  audio = load_audio(input_audio_path, 16000)
166
  audio_max = np.abs(audio).max() / 0.95
167
  if audio_max > 1:
168
+ audio /= audio_max
169
  times = [0, 0, 0]
170
 
171
  if self.hubert_model is None:
172
+ self.hubert_model = load_hubert(self.config)
173
 
174
  if file_index:
 
 
175
  file_index = (
176
  file_index.strip(" ")
177
  .strip('"')
 
190
  self.net_g,
191
  sid,
192
  audio,
193
+ input_audio_path,
194
  times,
195
  f0_up_key,
196
  f0_method,
 
204
  self.version,
205
  protect,
206
  f0_file,
207
+ )
208
  if self.tgt_sr != resample_sr >= 16000:
209
  tgt_sr = resample_sr
210
  else:
211
  tgt_sr = self.tgt_sr
212
  index_info = (
213
+ "Index:\n%s." % file_index
214
  if os.path.exists(file_index)
215
  else "Index not used."
216
  )
217
  return (
218
+ "Success.\n%s\nTime:\nnpy: %.2fs, f0: %.2fs, infer: %.2fs."
219
  % (index_info, *times),
220
  (tgt_sr, audio_opt),
221
  )
222
+ except:
223
  info = traceback.format_exc()
224
  logger.warning(info)
225
+ return info, (None, None)
226
 
227
  def vc_multi(
228
  self,
infer/modules/vc/pipeline.py CHANGED
@@ -5,22 +5,40 @@ import logging
5
 
6
  logger = logging.getLogger(__name__)
7
 
8
- from time import time
 
9
 
10
  import faiss
11
  import librosa
12
  import numpy as np
 
 
13
  import torch
14
  import torch.nn.functional as F
 
15
  from scipy import signal
16
 
17
- from rvc.f0 import PM, Harvest, RMVPE, CRePE, Dio, FCPE
18
-
19
  now_dir = os.getcwd()
20
  sys.path.append(now_dir)
21
 
22
  bh, ah = signal.butter(N=5, Wn=48, btype="high", fs=16000)
23
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24
 
25
  def change_rms(data1, sr1, data2, sr2, rate): # 1是输入音频,2是输出音频,rate是2的占比
26
  # print(data1.max(),data2.max())
@@ -65,6 +83,7 @@ class Pipeline(object):
65
 
66
  def get_f0(
67
  self,
 
68
  x,
69
  p_len,
70
  f0_up_key,
@@ -72,62 +91,73 @@ class Pipeline(object):
72
  filter_radius,
73
  inp_f0=None,
74
  ):
 
 
75
  f0_min = 50
76
  f0_max = 1100
77
  f0_mel_min = 1127 * np.log(1 + f0_min / 700)
78
  f0_mel_max = 1127 * np.log(1 + f0_max / 700)
79
  if f0_method == "pm":
80
- if not hasattr(self, "pm"):
81
- self.pm = PM(self.window, f0_min, f0_max, self.sr)
82
- f0 = self.pm.compute_f0(x, p_len=p_len)
83
- if f0_method == "dio":
84
- if not hasattr(self, "dio"):
85
- self.dio = Dio(self.window, f0_min, f0_max, self.sr)
86
- f0 = self.dio.compute_f0(x, p_len=p_len)
 
 
 
 
 
 
 
 
87
  elif f0_method == "harvest":
88
- if not hasattr(self, "harvest"):
89
- self.harvest = Harvest(self.window, f0_min, f0_max, self.sr)
90
- f0 = self.harvest.compute_f0(x, p_len=p_len, filter_radius=filter_radius)
 
91
  elif f0_method == "crepe":
92
- if not hasattr(self, "crepe"):
93
- self.crepe = CRePE(
94
- self.window,
95
- f0_min,
96
- f0_max,
97
- self.sr,
98
- self.device,
99
- )
100
- f0 = self.crepe.compute_f0(x, p_len=p_len)
 
 
 
 
 
 
 
 
 
 
 
101
  elif f0_method == "rmvpe":
102
- if not hasattr(self, "rmvpe"):
 
 
103
  logger.info(
104
- "Loading rmvpe model %s" % "%s/rmvpe.pt" % os.environ["rmvpe_root"]
105
  )
106
- self.rmvpe = RMVPE(
107
  "%s/rmvpe.pt" % os.environ["rmvpe_root"],
108
  is_half=self.is_half,
109
  device=self.device,
110
- # use_jit=self.config.use_jit,
111
  )
112
- f0 = self.rmvpe.compute_f0(x, p_len=p_len, filter_radius=0.03)
113
 
114
  if "privateuseone" in str(self.device): # clean ortruntime memory
115
- del self.rmvpe.model
116
- del self.rmvpe
117
  logger.info("Cleaning ortruntime memory")
118
 
119
- elif f0_method == "fcpe":
120
- if not hasattr(self, "model_fcpe"):
121
- logger.info("Loading fcpe model")
122
- self.model_fcpe = FCPE(
123
- self.window,
124
- f0_min,
125
- f0_max,
126
- self.sr,
127
- self.device,
128
- )
129
- f0 = self.model_fcpe.compute_f0(x, p_len=p_len)
130
-
131
  f0 *= pow(2, f0_up_key / 12)
132
  # with open("test.txt","w")as f:f.write("\n".join([str(i)for i in f0.tolist()]))
133
  tf0 = self.sr // self.window # 每秒f0点数
@@ -184,7 +214,7 @@ class Pipeline(object):
184
  "padding_mask": padding_mask,
185
  "output_layer": 9 if version == "v1" else 12,
186
  }
187
- t0 = time()
188
  with torch.no_grad():
189
  logits = model.extract_features(**inputs)
190
  feats = model.final_proj(logits[0]) if version == "v1" else logits[0]
@@ -202,10 +232,7 @@ class Pipeline(object):
202
  # _, I = index.search(npy, 1)
203
  # npy = big_npy[I.squeeze()]
204
 
205
- try:
206
- score, ix = index.search(npy, k=8)
207
- except:
208
- raise Exception("index mistatch")
209
  weight = np.square(1 / score)
210
  weight /= weight.sum(axis=1, keepdims=True)
211
  npy = np.sum(big_npy[ix] * np.expand_dims(weight, axis=2), axis=1)
@@ -222,7 +249,7 @@ class Pipeline(object):
222
  feats0 = F.interpolate(feats0.permute(0, 2, 1), scale_factor=2).permute(
223
  0, 2, 1
224
  )
225
- t1 = time()
226
  p_len = audio0.shape[0] // self.window
227
  if feats.shape[1] < p_len:
228
  p_len = feats.shape[1]
@@ -239,26 +266,14 @@ class Pipeline(object):
239
  feats = feats.to(feats0.dtype)
240
  p_len = torch.tensor([p_len], device=self.device).long()
241
  with torch.no_grad():
242
- audio1 = (
243
- (
244
- net_g.infer(
245
- feats,
246
- p_len,
247
- sid,
248
- pitch=pitch,
249
- pitchf=pitchf,
250
- )[0, 0]
251
- )
252
- .data.cpu()
253
- .float()
254
- .numpy()
255
- )
256
  del feats, p_len, padding_mask
257
  if torch.cuda.is_available():
258
  torch.cuda.empty_cache()
259
- elif torch.backends.mps.is_available():
260
- torch.mps.empty_cache()
261
- t2 = time()
262
  times[0] += t1 - t0
263
  times[2] += t2 - t1
264
  return audio1
@@ -269,6 +284,7 @@ class Pipeline(object):
269
  net_g,
270
  sid,
271
  audio,
 
272
  times,
273
  f0_up_key,
274
  f0_method,
@@ -292,6 +308,7 @@ class Pipeline(object):
292
  ):
293
  try:
294
  index = faiss.read_index(file_index)
 
295
  big_npy = index.reconstruct_n(0, index.ntotal)
296
  except:
297
  traceback.print_exc()
@@ -317,7 +334,7 @@ class Pipeline(object):
317
  s = 0
318
  audio_opt = []
319
  t = None
320
- t1 = time()
321
  audio_pad = np.pad(audio, (self.t_pad, self.t_pad), mode="reflect")
322
  p_len = audio_pad.shape[0] // self.window
323
  inp_f0 = None
@@ -333,29 +350,27 @@ class Pipeline(object):
333
  traceback.print_exc()
334
  sid = torch.tensor(sid, device=self.device).unsqueeze(0).long()
335
  pitch, pitchf = None, None
336
- if if_f0:
337
- if if_f0 == 1:
338
- pitch, pitchf = self.get_f0(
339
- audio_pad,
340
- p_len,
341
- f0_up_key,
342
- f0_method,
343
- filter_radius,
344
- inp_f0,
345
- )
346
- elif if_f0 == 2:
347
- pitch, pitchf = f0_method
348
  pitch = pitch[:p_len]
349
  pitchf = pitchf[:p_len]
350
  if "mps" not in str(self.device) or "xpu" not in str(self.device):
351
  pitchf = pitchf.astype(np.float32)
352
  pitch = torch.tensor(pitch, device=self.device).unsqueeze(0).long()
353
  pitchf = torch.tensor(pitchf, device=self.device).unsqueeze(0).float()
354
- t2 = time()
355
  times[1] += t2 - t1
356
  for t in opt_ts:
357
  t = t // self.window * self.window
358
- if if_f0:
359
  audio_opt.append(
360
  self.vc(
361
  model,
@@ -390,7 +405,7 @@ class Pipeline(object):
390
  )[self.t_pad_tgt : -self.t_pad_tgt]
391
  )
392
  s = t
393
- if if_f0:
394
  audio_opt.append(
395
  self.vc(
396
  model,
@@ -435,10 +450,8 @@ class Pipeline(object):
435
  max_int16 = 32768
436
  if audio_max > 1:
437
  max_int16 /= audio_max
438
- np.multiply(audio_opt, max_int16, audio_opt)
439
  del pitch, pitchf, sid
440
  if torch.cuda.is_available():
441
  torch.cuda.empty_cache()
442
- elif torch.backends.mps.is_available():
443
- torch.mps.empty_cache()
444
  return audio_opt
 
5
 
6
  logger = logging.getLogger(__name__)
7
 
8
+ from functools import lru_cache
9
+ from time import time as ttime
10
 
11
  import faiss
12
  import librosa
13
  import numpy as np
14
+ import parselmouth
15
+ import pyworld
16
  import torch
17
  import torch.nn.functional as F
18
+ import torchcrepe
19
  from scipy import signal
20
 
 
 
21
  now_dir = os.getcwd()
22
  sys.path.append(now_dir)
23
 
24
  bh, ah = signal.butter(N=5, Wn=48, btype="high", fs=16000)
25
 
26
+ input_audio_path2wav = {}
27
+
28
+
29
+ @lru_cache
30
+ def cache_harvest_f0(input_audio_path, fs, f0max, f0min, frame_period):
31
+ audio = input_audio_path2wav[input_audio_path]
32
+ f0, t = pyworld.harvest(
33
+ audio,
34
+ fs=fs,
35
+ f0_ceil=f0max,
36
+ f0_floor=f0min,
37
+ frame_period=frame_period,
38
+ )
39
+ f0 = pyworld.stonemask(audio, f0, t, fs)
40
+ return f0
41
+
42
 
43
  def change_rms(data1, sr1, data2, sr2, rate): # 1是输入音频,2是输出音频,rate是2的占比
44
  # print(data1.max(),data2.max())
 
83
 
84
  def get_f0(
85
  self,
86
+ input_audio_path,
87
  x,
88
  p_len,
89
  f0_up_key,
 
91
  filter_radius,
92
  inp_f0=None,
93
  ):
94
+ global input_audio_path2wav
95
+ time_step = self.window / self.sr * 1000
96
  f0_min = 50
97
  f0_max = 1100
98
  f0_mel_min = 1127 * np.log(1 + f0_min / 700)
99
  f0_mel_max = 1127 * np.log(1 + f0_max / 700)
100
  if f0_method == "pm":
101
+ f0 = (
102
+ parselmouth.Sound(x, self.sr)
103
+ .to_pitch_ac(
104
+ time_step=time_step / 1000,
105
+ voicing_threshold=0.6,
106
+ pitch_floor=f0_min,
107
+ pitch_ceiling=f0_max,
108
+ )
109
+ .selected_array["frequency"]
110
+ )
111
+ pad_size = (p_len - len(f0) + 1) // 2
112
+ if pad_size > 0 or p_len - len(f0) - pad_size > 0:
113
+ f0 = np.pad(
114
+ f0, [[pad_size, p_len - len(f0) - pad_size]], mode="constant"
115
+ )
116
  elif f0_method == "harvest":
117
+ input_audio_path2wav[input_audio_path] = x.astype(np.double)
118
+ f0 = cache_harvest_f0(input_audio_path, self.sr, f0_max, f0_min, 10)
119
+ if filter_radius > 2:
120
+ f0 = signal.medfilt(f0, 3)
121
  elif f0_method == "crepe":
122
+ model = "full"
123
+ # Pick a batch size that doesn't cause memory errors on your gpu
124
+ batch_size = 512
125
+ # Compute pitch using first gpu
126
+ audio = torch.tensor(np.copy(x))[None].float()
127
+ f0, pd = torchcrepe.predict(
128
+ audio,
129
+ self.sr,
130
+ self.window,
131
+ f0_min,
132
+ f0_max,
133
+ model,
134
+ batch_size=batch_size,
135
+ device=self.device,
136
+ return_periodicity=True,
137
+ )
138
+ pd = torchcrepe.filter.median(pd, 3)
139
+ f0 = torchcrepe.filter.mean(f0, 3)
140
+ f0[pd < 0.1] = 0
141
+ f0 = f0[0].cpu().numpy()
142
  elif f0_method == "rmvpe":
143
+ if not hasattr(self, "model_rmvpe"):
144
+ from infer.lib.rmvpe import RMVPE
145
+
146
  logger.info(
147
+ "Loading rmvpe model,%s" % "%s/rmvpe.pt" % os.environ["rmvpe_root"]
148
  )
149
+ self.model_rmvpe = RMVPE(
150
  "%s/rmvpe.pt" % os.environ["rmvpe_root"],
151
  is_half=self.is_half,
152
  device=self.device,
 
153
  )
154
+ f0 = self.model_rmvpe.infer_from_audio(x, thred=0.03)
155
 
156
  if "privateuseone" in str(self.device): # clean ortruntime memory
157
+ del self.model_rmvpe.model
158
+ del self.model_rmvpe
159
  logger.info("Cleaning ortruntime memory")
160
 
 
 
 
 
 
 
 
 
 
 
 
 
161
  f0 *= pow(2, f0_up_key / 12)
162
  # with open("test.txt","w")as f:f.write("\n".join([str(i)for i in f0.tolist()]))
163
  tf0 = self.sr // self.window # 每秒f0点数
 
214
  "padding_mask": padding_mask,
215
  "output_layer": 9 if version == "v1" else 12,
216
  }
217
+ t0 = ttime()
218
  with torch.no_grad():
219
  logits = model.extract_features(**inputs)
220
  feats = model.final_proj(logits[0]) if version == "v1" else logits[0]
 
232
  # _, I = index.search(npy, 1)
233
  # npy = big_npy[I.squeeze()]
234
 
235
+ score, ix = index.search(npy, k=8)
 
 
 
236
  weight = np.square(1 / score)
237
  weight /= weight.sum(axis=1, keepdims=True)
238
  npy = np.sum(big_npy[ix] * np.expand_dims(weight, axis=2), axis=1)
 
249
  feats0 = F.interpolate(feats0.permute(0, 2, 1), scale_factor=2).permute(
250
  0, 2, 1
251
  )
252
+ t1 = ttime()
253
  p_len = audio0.shape[0] // self.window
254
  if feats.shape[1] < p_len:
255
  p_len = feats.shape[1]
 
266
  feats = feats.to(feats0.dtype)
267
  p_len = torch.tensor([p_len], device=self.device).long()
268
  with torch.no_grad():
269
+ hasp = pitch is not None and pitchf is not None
270
+ arg = (feats, p_len, pitch, pitchf, sid) if hasp else (feats, p_len, sid)
271
+ audio1 = (net_g.infer(*arg)[0][0, 0]).data.cpu().float().numpy()
272
+ del hasp, arg
 
 
 
 
 
 
 
 
 
 
273
  del feats, p_len, padding_mask
274
  if torch.cuda.is_available():
275
  torch.cuda.empty_cache()
276
+ t2 = ttime()
 
 
277
  times[0] += t1 - t0
278
  times[2] += t2 - t1
279
  return audio1
 
284
  net_g,
285
  sid,
286
  audio,
287
+ input_audio_path,
288
  times,
289
  f0_up_key,
290
  f0_method,
 
308
  ):
309
  try:
310
  index = faiss.read_index(file_index)
311
+ # big_npy = np.load(file_big_npy)
312
  big_npy = index.reconstruct_n(0, index.ntotal)
313
  except:
314
  traceback.print_exc()
 
334
  s = 0
335
  audio_opt = []
336
  t = None
337
+ t1 = ttime()
338
  audio_pad = np.pad(audio, (self.t_pad, self.t_pad), mode="reflect")
339
  p_len = audio_pad.shape[0] // self.window
340
  inp_f0 = None
 
350
  traceback.print_exc()
351
  sid = torch.tensor(sid, device=self.device).unsqueeze(0).long()
352
  pitch, pitchf = None, None
353
+ if if_f0 == 1:
354
+ pitch, pitchf = self.get_f0(
355
+ input_audio_path,
356
+ audio_pad,
357
+ p_len,
358
+ f0_up_key,
359
+ f0_method,
360
+ filter_radius,
361
+ inp_f0,
362
+ )
 
 
363
  pitch = pitch[:p_len]
364
  pitchf = pitchf[:p_len]
365
  if "mps" not in str(self.device) or "xpu" not in str(self.device):
366
  pitchf = pitchf.astype(np.float32)
367
  pitch = torch.tensor(pitch, device=self.device).unsqueeze(0).long()
368
  pitchf = torch.tensor(pitchf, device=self.device).unsqueeze(0).float()
369
+ t2 = ttime()
370
  times[1] += t2 - t1
371
  for t in opt_ts:
372
  t = t // self.window * self.window
373
+ if if_f0 == 1:
374
  audio_opt.append(
375
  self.vc(
376
  model,
 
405
  )[self.t_pad_tgt : -self.t_pad_tgt]
406
  )
407
  s = t
408
+ if if_f0 == 1:
409
  audio_opt.append(
410
  self.vc(
411
  model,
 
450
  max_int16 = 32768
451
  if audio_max > 1:
452
  max_int16 /= audio_max
453
+ audio_opt = (audio_opt * max_int16).astype(np.int16)
454
  del pitch, pitchf, sid
455
  if torch.cuda.is_available():
456
  torch.cuda.empty_cache()
 
 
457
  return audio_opt
infer/modules/vc/utils.py CHANGED
@@ -9,8 +9,7 @@ def get_index_path_from_model(sid):
9
  f
10
  for f in [
11
  os.path.join(root, name)
12
- for path in [os.getenv("outside_index_root"), os.getenv("index_root")]
13
- for root, _, files in os.walk(path, topdown=False)
14
  for name in files
15
  if name.endswith(".index") and "trained" not in name
16
  ]
@@ -20,14 +19,14 @@ def get_index_path_from_model(sid):
20
  )
21
 
22
 
23
- def load_hubert(device, is_half):
24
  models, _, _ = checkpoint_utils.load_model_ensemble_and_task(
25
  ["assets/hubert/hubert_base.pt"],
26
  suffix="",
27
  )
28
  hubert_model = models[0]
29
- hubert_model = hubert_model.to(device)
30
- if is_half:
31
  hubert_model = hubert_model.half()
32
  else:
33
  hubert_model = hubert_model.float()
 
9
  f
10
  for f in [
11
  os.path.join(root, name)
12
+ for root, _, files in os.walk(os.getenv("index_root"), topdown=False)
 
13
  for name in files
14
  if name.endswith(".index") and "trained" not in name
15
  ]
 
19
  )
20
 
21
 
22
+ def load_hubert(config):
23
  models, _, _ = checkpoint_utils.load_model_ensemble_and_task(
24
  ["assets/hubert/hubert_base.pt"],
25
  suffix="",
26
  )
27
  hubert_model = models[0]
28
+ hubert_model = hubert_model.to(config.device)
29
+ if config.is_half:
30
  hubert_model = hubert_model.half()
31
  else:
32
  hubert_model = hubert_model.float()
requirements.txt CHANGED
@@ -1,15 +1,16 @@
1
  joblib>=1.1.0
2
- numba
3
  numpy==1.23.5
4
  scipy
5
  librosa==0.9.1
6
- llvmlite
7
- fairseq
8
- faiss-cpu
9
- gradio
10
  Cython
11
  pydub>=0.25.1
12
  soundfile>=0.12.1
 
13
  tensorboardX
14
  Jinja2>=3.1.2
15
  json5
@@ -40,8 +41,8 @@ httpx
40
  onnxruntime; sys_platform == 'darwin'
41
  onnxruntime-gpu; sys_platform != 'darwin'
42
  torchcrepe==0.0.20
43
- fastapi
44
  torchfcpe
 
45
  python-dotenv>=1.0.0
46
  av
47
- pybase16384
 
1
  joblib>=1.1.0
2
+ numba==0.56.4
3
  numpy==1.23.5
4
  scipy
5
  librosa==0.9.1
6
+ llvmlite==0.39.0
7
+ fairseq==0.12.2
8
+ faiss-cpu==1.7.3
9
+ gradio==3.34.0
10
  Cython
11
  pydub>=0.25.1
12
  soundfile>=0.12.1
13
+ ffmpeg-python>=0.2.0
14
  tensorboardX
15
  Jinja2>=3.1.2
16
  json5
 
41
  onnxruntime; sys_platform == 'darwin'
42
  onnxruntime-gpu; sys_platform != 'darwin'
43
  torchcrepe==0.0.20
44
+ fastapi==0.88
45
  torchfcpe
46
+ ffmpy==0.3.1
47
  python-dotenv>=1.0.0
48
  av