diff --git a/Dockerfile b/Dockerfile
index 651e74bdad7eedadb87d2c5c2d2c586e8fdf074f..f9872b4420291ee5c82695e7e57acf4506fc1b76 100644
--- a/Dockerfile
+++ b/Dockerfile
@@ -6,10 +6,14 @@ RUN apt update
RUN apt install -y git libsndfile1-dev python3 python3-dev python3-pip ffmpeg
RUN python3 -m pip install --no-cache-dir --upgrade pip
-RUN git clone https://github.com/svc-develop-team/so-vits-svc.git && cd so-vits-svc
+COPY ./so-vits-svc /work/
+cd /work/so-vits/pretrain/nsf_hifigan
+wget -c https://github.com/openvpi/vocoders/releases/download/nsf-hifigan-v1/nsf_hifigan_20221211.zip
+unzip -q nsf_hifigan_20221211.zip
+cd /work/so-vits-svc
RUN pip install --no-cache-dir --upgrade -r /work/so-vits-svc/requirements.txt
ENV SERVER_NAME="0.0.0.0"
ENV SERVER_PORT=7860
-CMD ["python", "webUI.py"]
+RUN python webUI.py
diff --git a/so-vits-svc b/so-vits-svc
deleted file mode 160000
index 5977fb41d9930440c4a5a18b4badf4a7444af5c8..0000000000000000000000000000000000000000
--- a/so-vits-svc
+++ /dev/null
@@ -1 +0,0 @@
-Subproject commit 5977fb41d9930440c4a5a18b4badf4a7444af5c8
diff --git a/so-vits-svc/LICENSE b/so-vits-svc/LICENSE
new file mode 100644
index 0000000000000000000000000000000000000000..28bac26685278dcadb12f316bf4664395345c4f8
--- /dev/null
+++ b/so-vits-svc/LICENSE
@@ -0,0 +1,28 @@
+BSD 3-Clause License
+
+Copyright (c) 2023, SVC Develop Team
+
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions are met:
+
+1. Redistributions of source code must retain the above copyright notice, this
+ list of conditions and the following disclaimer.
+
+2. Redistributions in binary form must reproduce the above copyright notice,
+ this list of conditions and the following disclaimer in the documentation
+ and/or other materials provided with the distribution.
+
+3. Neither the name of the copyright holder nor the names of its
+ contributors may be used to endorse or promote products derived from
+ this software without specific prior written permission.
+
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
diff --git a/so-vits-svc/README.md b/so-vits-svc/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..5805769fc7f98272fc9ac87dcaec6e24fd9c5ec2
--- /dev/null
+++ b/so-vits-svc/README.md
@@ -0,0 +1,292 @@
+# SoftVC VITS Singing Voice Conversion
+
+In the field of Singing Voice Conversion, there is not only one project, SoVitsSvc, but also many other projects, which will not be listed here. The project was officially discontinued for maintenance and Archived.
+However, there are still other enthusiasts who have created their own branches and continue to maintain the SoVitsSvc project (still unrelated to SvcDevelopTeam and the repository maintainers) and have made some big changes to it for you to find out for yourself.
+
+#### ✨ A fork with a greatly improved interface: [34j/so-vits-svc-fork](https://github.com/34j/so-vits-svc-fork)
+
+#### ✨ A client supports real-time conversion: [w-okada/voice-changer](https://github.com/w-okada/voice-changer)
+
+#### This project is fundamentally different from Vits. Vits is TTS and this project is SVC. TTS cannot be carried out in this project, and Vits cannot carry out SVC, and the two project models are not universal
+
+## Disclaimer
+
+This project is an open source, offline project, and all members of SvcDevelopTeam and all developers and maintainers of this project (hereinafter referred to as contributors) have no control over this project. The contributor of this project has never provided any organization or individual with any form of assistance, including but not limited to data set extraction, data set processing, computing support, training support, infering, etc. Contributors to the project do not and cannot know what users are using the project for. Therefore, all AI models and synthesized audio based on the training of this project have nothing to do with the contributors of this project. All problems arising therefrom shall be borne by the user.
+
+This project is run completely offline and cannot collect any user information or obtain user input data. Therefore, contributors to this project are not aware of all user input and models and therefore are not responsible for any user input.
+
+This project is only a framework project, which does not have the function of speech synthesis itself, and all the functions require the user to train the model themselves. Meanwhile, there is no model attached to this project, and any secondary distributed project has nothing to do with the contributors of this project
+
+## 📏 Terms of Use
+
+# Warning: Please solve the authorization problem of the dataset on your own. You shall be solely responsible for any problems caused by the use of non-authorized datasets for training and all consequences thereof.The repository and its maintainer, svc develop team, have nothing to do with the consequences!
+
+1. This project is established for academic exchange purposes only and is intended for communication and learning purposes. It is not intended for production environments.
+2. Any videos based on sovits that are published on video platforms must clearly indicate in the description that they are used for voice changing and specify the input source of the voice or audio, for example, using videos or audios published by others and separating the vocals as input source for conversion, which must provide clear original video or music links. If your own voice or other synthesized voices from other commercial vocal synthesis software are used as the input source for conversion, you must also explain it in the description.
+3. You shall be solely responsible for any infringement problems caused by the input source. When using other commercial vocal synthesis software as input source, please ensure that you comply with the terms of use of the software. Note that many vocal synthesis engines clearly state in their terms of use that they cannot be used for input source conversion.
+4. It is forbidden to use the project to engage in illegal activities, religious and political activities. The project developers firmly resist the above activities. If they do not agree with this article, the use of the project is prohibited.
+5. Continuing to use this project is deemed as agreeing to the relevant provisions stated in this repository README. This repository README has the obligation to persuade, and is not responsible for any subsequent problems that may arise.
+6. If you use this project for any other plan, please contact and inform the author of this repository in advance. Thank you very much.
+
+## 🆕 Update!
+
+> Updated the 4.0-v2 model, the entire process is the same as 4.0. Compared to 4.0, there is some improvement in certain scenarios, but there are also some cases where it has regressed. Please refer to the [4.0-v2 branch](https://github.com/svc-develop-team/so-vits-svc/tree/4.0-v2) for more information.
+
+## 📝 4.0 Feature list of branches
+
+| Branch | Feature | whether compatible with the main branch model |
+| :-------------: | :----------: | :------------: |
+| 4.0 | main branch | - |
+| 4.0v2 | The VISinger2 model is used | incompatibility |
+| 4.0-Vec768-Layer12 | The feature input is the Layer 12 Transformer output of the Content Vec | incompatibility |
+
+## 📝 Model Introduction
+
+The singing voice conversion model uses SoftVC content encoder to extract source audio speech features, then the vectors are directly fed into VITS instead of converting to a text based intermediate; thus the pitch and intonations are conserved. Additionally, the vocoder is changed to [NSF HiFiGAN](https://github.com/openvpi/DiffSinger/tree/refactor/modules/nsf_hifigan) to solve the problem of sound interruption.
+
+### 🆕 4.0 Version Update Content
+
+- Feature input is changed to [Content Vec](https://github.com/auspicious3000/contentvec)
+- The sampling rate is unified to use 44100Hz
+- Due to the change of hop size and other parameters, as well as the streamlining of some model structures, the required GPU memory for inference is **significantly reduced**. The 44kHz GPU memory usage of version 4.0 is even smaller than the 32kHz usage of version 3.0.
+- Some code structures have been adjusted
+- The dataset creation and training process are consistent with version 3.0, but the model is completely non-universal, and the data set needs to be fully pre-processed again.
+- Added an option 1: automatic pitch prediction for vc mode, which means that you don't need to manually enter the pitch key when converting speech, and the pitch of male and female voices can be automatically converted. However, this mode will cause pitch shift when converting songs.
+- Added option 2: reduce timbre leakage through k-means clustering scheme, making the timbre more similar to the target timbre.
+- Added option 3: Added [NSF-HIFIGAN Enhancer](https://github.com/yxlllc/DDSP-SVC), which has certain sound quality enhancement effect on some models with few train-sets, but has negative effect on well-trained models, so it is closed by default
+
+## 💬 About Python Version
+
+After conducting tests, we believe that the project runs stably on `Python 3.8.9`.
+
+## 📥 Pre-trained Model Files
+
+#### **Required**
+
+- ContentVec: [checkpoint_best_legacy_500.pt](https://ibm.box.com/s/z1wgl1stco8ffooyatzdwsqn2psd9lrr)
+ - Place it under the `hubert` directory
+
+```shell
+# contentvec
+wget -P hubert/ http://obs.cstcloud.cn/share/obs/sankagenkeshi/checkpoint_best_legacy_500.pt
+# Alternatively, you can manually download and place it in the hubert directory
+```
+
+#### **Optional(Strongly recommend)**
+
+- Pre-trained model files: `G_0.pth` `D_0.pth`
+ - Place them under the `logs/44k` directory
+
+Get them from svc-develop-team(TBD) or anywhere else.
+
+Although the pretrained model generally does not cause any copyright problems, please pay attention to it. For example, ask the author in advance, or the author has indicated the feasible use in the description clearly.
+
+#### **Optional(Select as Required)**
+
+If you are using the NSF-HIFIGAN enhancer, you will need to download the pre-trained NSF-HIFIGAN model, or not if you do not need it.
+
+- Pre-trained NSF-HIFIGAN Vocoder: [nsf_hifigan_20221211.zip](https://github.com/openvpi/vocoders/releases/download/nsf-hifigan-v1/nsf_hifigan_20221211.zip)
+ - Unzip and place the four files under the `pretrain/nsf_hifigan` directory
+
+```shell
+# nsf_hifigan
+https://github.com/openvpi/vocoders/releases/download/nsf-hifigan-v1/nsf_hifigan_20221211.zip
+# Alternatively, you can manually download and place it in the pretrain/nsf_hifigan directory
+# URL:https://github.com/openvpi/vocoders/releases/tag/nsf-hifigan-v1
+```
+
+## 📊 Dataset Preparation
+
+Simply place the dataset in the `dataset_raw` directory with the following file structure.
+
+```
+dataset_raw
+├───speaker0
+│ ├───xxx1-xxx1.wav
+│ ├───...
+│ └───Lxx-0xx8.wav
+└───speaker1
+ ├───xx2-0xxx2.wav
+ ├───...
+ └───xxx7-xxx007.wav
+```
+
+You can customize the speaker name.
+
+```
+dataset_raw
+└───suijiSUI
+ ├───1.wav
+ ├───...
+ └───25788785-20221210-200143-856_01_(Vocals)_0_0.wav
+```
+
+## 🛠️ Preprocessing
+
+### 0. Slice audio
+
+Slice to `5s - 15s`, a bit longer is no problem. Too long may lead to `torch.cuda.OutOfMemoryError` during training or even pre-processing.
+
+By using [audio-slicer-GUI](https://github.com/flutydeer/audio-slicer) or [audio-slicer-CLI](https://github.com/openvpi/audio-slicer)
+
+In general, only the `Minimum Interval` needs to be adjusted. For statement audio it usually remains default. For singing audio it can be adjusted to `100` or even `50`.
+
+After slicing, delete audio that is too long and too short.
+
+### 1. Resample to 44100Hz and mono
+
+```shell
+python resample.py
+```
+
+### 2. Automatically split the dataset into training and validation sets, and generate configuration files.
+
+```shell
+python preprocess_flist_config.py
+```
+
+### 3. Generate hubert and f0
+
+```shell
+python preprocess_hubert_f0.py
+```
+
+After completing the above steps, the dataset directory will contain the preprocessed data, and the dataset_raw folder can be deleted.
+
+#### You can modify some parameters in the generated config.json
+
+* `keep_ckpts`: Keep the last `keep_ckpts` models during training. Set to `0` will keep them all. Default is `3`.
+
+* `all_in_mem`: Load all dataset to RAM. It can be enabled when the disk IO of some platforms is too low and the system memory is **much larger** than your dataset.
+
+## 🏋️♀️ Training
+
+```shell
+python train.py -c configs/config.json -m 44k
+```
+
+## 🤖 Inference
+
+Use [inference_main.py](https://github.com/svc-develop-team/so-vits-svc/blob/4.0/inference_main.py)
+
+```shell
+# Example
+python inference_main.py -m "logs/44k/G_30400.pth" -c "configs/config.json" -s "nen" -n "君の知らない物語-src.wav" -t 0
+```
+
+Required parameters:
+- `-m` | `--model_path`: Path to the model.
+- `-c` | `--config_path`: Path to the configuration file.
+- `-s` | `--spk_list`: Target speaker name for conversion.
+- `-n` | `--clean_names`: A list of wav file names located in the raw folder.
+- `-t` | `--trans`: Pitch adjustment, supports positive and negative (semitone) values.
+
+Optional parameters: see the next section
+- `-a` | `--auto_predict_f0`: Automatic pitch prediction for voice conversion. Do not enable this when converting songs as it can cause serious pitch issues.
+- `-cl` | `--clip`: Voice forced slicing. Set to 0 to turn off(default), duration in seconds.
+- `-lg` | `--linear_gradient`: The cross fade length of two audio slices in seconds. If there is a discontinuous voice after forced slicing, you can adjust this value. Otherwise, it is recommended to use. Default 0.
+- `-cm` | `--cluster_model_path`: Path to the clustering model. Fill in any value if clustering is not trained.
+- `-cr` | `--cluster_infer_ratio`: Proportion of the clustering solution, range 0-1. Fill in 0 if the clustering model is not trained.
+- `-fmp` | `--f0_mean_pooling`: Apply mean filter (pooling) to f0, which may improve some hoarse sounds. Enabling this option will reduce inference speed.
+- `-eh` | `--enhance`: Whether to use NSF_HIFIGAN enhancer. This option has certain effect on sound quality enhancement for some models with few training sets, but has negative effect on well-trained models, so it is turned off by default.
+
+## 🤔 Optional Settings
+
+If the results from the previous section are satisfactory, or if you didn't understand what is being discussed in the following section, you can skip it, and it won't affect the model usage. (These optional settings have a relatively small impact, and they may have some effect on certain specific data, but in most cases, the difference may not be noticeable.)
+
+### Automatic f0 prediction
+
+During the 4.0 model training, an f0 predictor is also trained, which can be used for automatic pitch prediction during voice conversion. However, if the effect is not good, manual pitch prediction can be used instead. But please do not enable this feature when converting singing voice as it may cause serious pitch shifting!
+- Set `auto_predict_f0` to true in inference_main.
+
+### Cluster-based timbre leakage control
+
+Introduction: The clustering scheme can reduce timbre leakage and make the trained model sound more like the target's timbre (although this effect is not very obvious), but using clustering alone will lower the model's clarity (the model may sound unclear). Therefore, this model adopts a fusion method to linearly control the proportion of clustering and non-clustering schemes. In other words, you can manually adjust the ratio between "sounding like the target's timbre" and "being clear and articulate" to find a suitable trade-off point.
+
+The existing steps before clustering do not need to be changed. All you need to do is to train an additional clustering model, which has a relatively low training cost.
+
+- Training process:
+ - Train on a machine with good CPU performance. According to my experience, it takes about 4 minutes to train each speaker on a Tencent Cloud machine with 6-core CPU.
+ - Execute `python cluster/train_cluster.py`. The output model will be saved in `logs/44k/kmeans_10000.pt`.
+- Inference process:
+ - Specify `cluster_model_path` in `inference_main.py`.
+ - Specify `cluster_infer_ratio` in `inference_main.py`, where `0` means not using clustering at all, `1` means only using clustering, and usually `0.5` is sufficient.
+
+### F0 mean filtering
+
+Introduction: The mean filtering of F0 can effectively reduce the hoarse sound caused by the predicted fluctuation of pitch (the hoarse sound caused by reverb or harmony can not be eliminated temporarily). This function has been greatly improved on some songs. However, some songs are out of tune. If the song appears dumb after reasoning, it can be considered to open.
+
+- Set `f0_mean_pooling` to true in `inference_main.py`
+
+### [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/svc-develop-team/so-vits-svc/blob/4.0/sovits4_for_colab.ipynb) [sovits4_for_colab.ipynb](https://colab.research.google.com/github/svc-develop-team/so-vits-svc/blob/4.0/sovits4_for_colab.ipynb)
+
+**[23/03/16] No longer need to download hubert manually**
+
+**[23/04/14] Support NSF_HIFIGAN enhancer**
+
+## 📤 Exporting to Onnx
+
+Use [onnx_export.py](https://github.com/svc-develop-team/so-vits-svc/blob/4.0/onnx_export.py)
+
+- Create a folder named `checkpoints` and open it
+- Create a folder in the `checkpoints` folder as your project folder, naming it after your project, for example `aziplayer`
+- Rename your model as `model.pth`, the configuration file as `config.json`, and place them in the `aziplayer` folder you just created
+- Modify `"NyaruTaffy"` in `path = "NyaruTaffy"` in [onnx_export.py](https://github.com/svc-develop-team/so-vits-svc/blob/4.0/onnx_export.py) to your project name, `path = "aziplayer"`
+- Run [onnx_export.py](https://github.com/svc-develop-team/so-vits-svc/blob/4.0/onnx_export.py)
+- Wait for it to finish running. A `model.onnx` will be generated in your project folder, which is the exported model.
+
+### UI support for Onnx models
+
+- [MoeSS](https://github.com/NaruseMioShirakana/MoeSS)
+ - [Hubert4.0](https://huggingface.co/NaruseMioShirakana/MoeSS-SUBModel)
+
+Note: For Hubert Onnx models, please use the models provided by MoeSS. Currently, they cannot be exported on their own (Hubert in fairseq has many unsupported operators and things involving constants that can cause errors or result in problems with the input/output shape and results when exported.)
+
+CppDataProcess are some functions to preprocess data used in MoeSS
+
+## ☀️ Previous contributors
+
+For some reason the author deleted the original repository. Because of the negligence of the organization members, the contributor list was cleared because all files were directly reuploaded to this repository at the beginning of the reconstruction of this repository. Now add a previous contributor list to README.md.
+
+*Some members have not listed according to their personal wishes.*
+
+
+
+## 📚 Some legal provisions for reference
+
+#### Any country, region, organization, or individual using this project must comply with the following laws.
+
+#### 《民法典》
+
+##### 第一千零一十九条
+
+任何组织或者个人不得以丑化、污损,或者利用信息技术手段伪造等方式侵害他人的肖像权。未经肖像权人同意,不得制作、使用、公开肖像权人的肖像,但是法律另有规定的除外。未经肖像权人同意,肖像作品权利人不得以发表、复制、发行、出租、展览等方式使用或者公开肖像权人的肖像。对自然人声音的保护,参照适用肖像权保护的有关规定。
+
+##### 第一千零二十四条
+
+【名誉权】民事主体享有名誉权。任何组织或者个人不得以侮辱、诽谤等方式侵害他人的名誉权。
+
+##### 第一千零二十七条
+
+【作品侵害名誉权】行为人发表的文学、艺术作品以真人真事或者特定人为描述对象,含有侮辱、诽谤内容,侵害他人名誉权的,受害人有权依法请求该行为人承担民事责任。行为人发表的文学、艺术作品不以特定人为描述对象,仅其中的情节与该特定人的情况相似的,不承担民事责任。
+
+#### 《[中华人民共和国宪法](http://www.gov.cn/guoqing/2018-03/22/content_5276318.htm)》
+
+#### 《[中华人民共和国刑法](http://gongbao.court.gov.cn/Details/f8e30d0689b23f57bfc782d21035c3.html?sw=%E4%B8%AD%E5%8D%8E%E4%BA%BA%E6%B0%91%E5%85%B1%E5%92%8C%E5%9B%BD%E5%88%91%E6%B3%95)》
+
+#### 《[中华人民共和国民法典](http://gongbao.court.gov.cn/Details/51eb6750b8361f79be8f90d09bc202.html)》
+
+## 💪 Thanks to all contributors for their efforts
+
+
+
diff --git a/so-vits-svc/cluster/__init__.py b/so-vits-svc/cluster/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..f1b9bde04e73e9218a5d534227caa4c25332f424
--- /dev/null
+++ b/so-vits-svc/cluster/__init__.py
@@ -0,0 +1,29 @@
+import numpy as np
+import torch
+from sklearn.cluster import KMeans
+
+def get_cluster_model(ckpt_path):
+ checkpoint = torch.load(ckpt_path)
+ kmeans_dict = {}
+ for spk, ckpt in checkpoint.items():
+ km = KMeans(ckpt["n_features_in_"])
+ km.__dict__["n_features_in_"] = ckpt["n_features_in_"]
+ km.__dict__["_n_threads"] = ckpt["_n_threads"]
+ km.__dict__["cluster_centers_"] = ckpt["cluster_centers_"]
+ kmeans_dict[spk] = km
+ return kmeans_dict
+
+def get_cluster_result(model, x, speaker):
+ """
+ x: np.array [t, 256]
+ return cluster class result
+ """
+ return model[speaker].predict(x)
+
+def get_cluster_center_result(model, x,speaker):
+ """x: np.array [t, 256]"""
+ predict = model[speaker].predict(x)
+ return model[speaker].cluster_centers_[predict]
+
+def get_center(model, x,speaker):
+ return model[speaker].cluster_centers_[x]
diff --git a/so-vits-svc/cluster/train_cluster.py b/so-vits-svc/cluster/train_cluster.py
new file mode 100644
index 0000000000000000000000000000000000000000..4ac025d400414226e66849407f477ae786c3d5d3
--- /dev/null
+++ b/so-vits-svc/cluster/train_cluster.py
@@ -0,0 +1,89 @@
+import os
+from glob import glob
+from pathlib import Path
+import torch
+import logging
+import argparse
+import torch
+import numpy as np
+from sklearn.cluster import KMeans, MiniBatchKMeans
+import tqdm
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+import time
+import random
+
+def train_cluster(in_dir, n_clusters, use_minibatch=True, verbose=False):
+
+ logger.info(f"Loading features from {in_dir}")
+ features = []
+ nums = 0
+ for path in tqdm.tqdm(in_dir.glob("*.soft.pt")):
+ features.append(torch.load(path).squeeze(0).numpy().T)
+ # print(features[-1].shape)
+ features = np.concatenate(features, axis=0)
+ print(nums, features.nbytes/ 1024**2, "MB , shape:",features.shape, features.dtype)
+ features = features.astype(np.float32)
+ logger.info(f"Clustering features of shape: {features.shape}")
+ t = time.time()
+ if use_minibatch:
+ kmeans = MiniBatchKMeans(n_clusters=n_clusters,verbose=verbose, batch_size=4096, max_iter=80).fit(features)
+ else:
+ kmeans = KMeans(n_clusters=n_clusters,verbose=verbose).fit(features)
+ print(time.time()-t, "s")
+
+ x = {
+ "n_features_in_": kmeans.n_features_in_,
+ "_n_threads": kmeans._n_threads,
+ "cluster_centers_": kmeans.cluster_centers_,
+ }
+ print("end")
+
+ return x
+
+
+if __name__ == "__main__":
+
+ parser = argparse.ArgumentParser()
+ parser.add_argument('--dataset', type=Path, default="./dataset/44k",
+ help='path of training data directory')
+ parser.add_argument('--output', type=Path, default="logs/44k",
+ help='path of model output directory')
+
+ args = parser.parse_args()
+
+ checkpoint_dir = args.output
+ dataset = args.dataset
+ n_clusters = 10000
+
+ ckpt = {}
+ for spk in os.listdir(dataset):
+ if os.path.isdir(dataset/spk):
+ print(f"train kmeans for {spk}...")
+ in_dir = dataset/spk
+ x = train_cluster(in_dir, n_clusters, verbose=False)
+ ckpt[spk] = x
+
+ checkpoint_path = checkpoint_dir / f"kmeans_{n_clusters}.pt"
+ checkpoint_path.parent.mkdir(exist_ok=True, parents=True)
+ torch.save(
+ ckpt,
+ checkpoint_path,
+ )
+
+
+ # import cluster
+ # for spk in tqdm.tqdm(os.listdir("dataset")):
+ # if os.path.isdir(f"dataset/{spk}"):
+ # print(f"start kmeans inference for {spk}...")
+ # for feature_path in tqdm.tqdm(glob(f"dataset/{spk}/*.discrete.npy", recursive=True)):
+ # mel_path = feature_path.replace(".discrete.npy",".mel.npy")
+ # mel_spectrogram = np.load(mel_path)
+ # feature_len = mel_spectrogram.shape[-1]
+ # c = np.load(feature_path)
+ # c = utils.tools.repeat_expand_2d(torch.FloatTensor(c), feature_len).numpy()
+ # feature = c.T
+ # feature_class = cluster.get_cluster_result(feature, spk)
+ # np.save(feature_path.replace(".discrete.npy", ".discrete_class.npy"), feature_class)
+
+
diff --git a/so-vits-svc/configs/config.json b/so-vits-svc/configs/config.json
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/so-vits-svc/configs_template/config_template.json b/so-vits-svc/configs_template/config_template.json
new file mode 100644
index 0000000000000000000000000000000000000000..a6555caef49bcb5159ec615adaff41120c93594d
--- /dev/null
+++ b/so-vits-svc/configs_template/config_template.json
@@ -0,0 +1,66 @@
+{
+ "train": {
+ "log_interval": 200,
+ "eval_interval": 800,
+ "seed": 1234,
+ "epochs": 10000,
+ "learning_rate": 0.0001,
+ "betas": [
+ 0.8,
+ 0.99
+ ],
+ "eps": 1e-09,
+ "batch_size": 6,
+ "fp16_run": false,
+ "lr_decay": 0.999875,
+ "segment_size": 10240,
+ "init_lr_ratio": 1,
+ "warmup_epochs": 0,
+ "c_mel": 45,
+ "c_kl": 1.0,
+ "use_sr": true,
+ "max_speclen": 512,
+ "port": "8001",
+ "keep_ckpts": 3,
+ "all_in_mem": false
+ },
+ "data": {
+ "training_files": "filelists/train.txt",
+ "validation_files": "filelists/val.txt",
+ "max_wav_value": 32768.0,
+ "sampling_rate": 44100,
+ "filter_length": 2048,
+ "hop_length": 512,
+ "win_length": 2048,
+ "n_mel_channels": 80,
+ "mel_fmin": 0.0,
+ "mel_fmax": 22050
+ },
+ "model": {
+ "inter_channels": 192,
+ "hidden_channels": 192,
+ "filter_channels": 768,
+ "n_heads": 2,
+ "n_layers": 6,
+ "kernel_size": 3,
+ "p_dropout": 0.1,
+ "resblock": "1",
+ "resblock_kernel_sizes": [3,7,11],
+ "resblock_dilation_sizes": [[1,3,5], [1,3,5], [1,3,5]],
+ "upsample_rates": [ 8, 8, 2, 2, 2],
+ "upsample_initial_channel": 512,
+ "upsample_kernel_sizes": [16,16, 4, 4, 4],
+ "n_layers_q": 3,
+ "use_spectral_norm": false,
+ "gin_channels": 256,
+ "ssl_dim": 256,
+ "n_speakers": 200
+ },
+ "spk": {
+ "nyaru": 0,
+ "huiyu": 1,
+ "nen": 2,
+ "paimon": 3,
+ "yunhao": 4
+ }
+}
\ No newline at end of file
diff --git a/so-vits-svc/data_utils.py b/so-vits-svc/data_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..7c76fd1c3a45b8304d916161718c7763874f3e35
--- /dev/null
+++ b/so-vits-svc/data_utils.py
@@ -0,0 +1,155 @@
+import time
+import os
+import random
+import numpy as np
+import torch
+import torch.utils.data
+
+import modules.commons as commons
+import utils
+from modules.mel_processing import spectrogram_torch, spec_to_mel_torch
+from utils import load_wav_to_torch, load_filepaths_and_text
+
+# import h5py
+
+
+"""Multi speaker version"""
+
+
+class TextAudioSpeakerLoader(torch.utils.data.Dataset):
+ """
+ 1) loads audio, speaker_id, text pairs
+ 2) normalizes text and converts them to sequences of integers
+ 3) computes spectrograms from audio files.
+ """
+
+ def __init__(self, audiopaths, hparams, all_in_mem: bool = False):
+ self.audiopaths = load_filepaths_and_text(audiopaths)
+ self.max_wav_value = hparams.data.max_wav_value
+ self.sampling_rate = hparams.data.sampling_rate
+ self.filter_length = hparams.data.filter_length
+ self.hop_length = hparams.data.hop_length
+ self.win_length = hparams.data.win_length
+ self.sampling_rate = hparams.data.sampling_rate
+ self.use_sr = hparams.train.use_sr
+ self.spec_len = hparams.train.max_speclen
+ self.spk_map = hparams.spk
+
+ random.seed(1234)
+ random.shuffle(self.audiopaths)
+
+ self.all_in_mem = all_in_mem
+ if self.all_in_mem:
+ self.cache = [self.get_audio(p[0]) for p in self.audiopaths]
+
+ def get_audio(self, filename):
+ filename = filename.replace("\\", "/")
+ audio, sampling_rate = load_wav_to_torch(filename)
+ if sampling_rate != self.sampling_rate:
+ raise ValueError("{} SR doesn't match target {} SR".format(
+ sampling_rate, self.sampling_rate))
+ audio_norm = audio / self.max_wav_value
+ audio_norm = audio_norm.unsqueeze(0)
+ spec_filename = filename.replace(".wav", ".spec.pt")
+
+ # Ideally, all data generated after Mar 25 should have .spec.pt
+ if os.path.exists(spec_filename):
+ spec = torch.load(spec_filename)
+ else:
+ spec = spectrogram_torch(audio_norm, self.filter_length,
+ self.sampling_rate, self.hop_length, self.win_length,
+ center=False)
+ spec = torch.squeeze(spec, 0)
+ torch.save(spec, spec_filename)
+
+ spk = filename.split("/")[-2]
+ spk = torch.LongTensor([self.spk_map[spk]])
+
+ f0 = np.load(filename + ".f0.npy")
+ f0, uv = utils.interpolate_f0(f0)
+ f0 = torch.FloatTensor(f0)
+ uv = torch.FloatTensor(uv)
+
+ c = torch.load(filename+ ".soft.pt")
+ c = utils.repeat_expand_2d(c.squeeze(0), f0.shape[0])
+
+
+ lmin = min(c.size(-1), spec.size(-1))
+ assert abs(c.size(-1) - spec.size(-1)) < 3, (c.size(-1), spec.size(-1), f0.shape, filename)
+ assert abs(audio_norm.shape[1]-lmin * self.hop_length) < 3 * self.hop_length
+ spec, c, f0, uv = spec[:, :lmin], c[:, :lmin], f0[:lmin], uv[:lmin]
+ audio_norm = audio_norm[:, :lmin * self.hop_length]
+
+ return c, f0, spec, audio_norm, spk, uv
+
+ def random_slice(self, c, f0, spec, audio_norm, spk, uv):
+ # if spec.shape[1] < 30:
+ # print("skip too short audio:", filename)
+ # return None
+ if spec.shape[1] > 800:
+ start = random.randint(0, spec.shape[1]-800)
+ end = start + 790
+ spec, c, f0, uv = spec[:, start:end], c[:, start:end], f0[start:end], uv[start:end]
+ audio_norm = audio_norm[:, start * self.hop_length : end * self.hop_length]
+
+ return c, f0, spec, audio_norm, spk, uv
+
+ def __getitem__(self, index):
+ if self.all_in_mem:
+ return self.random_slice(*self.cache[index])
+ else:
+ return self.random_slice(*self.get_audio(self.audiopaths[index][0]))
+
+ def __len__(self):
+ return len(self.audiopaths)
+
+
+class TextAudioCollate:
+
+ def __call__(self, batch):
+ batch = [b for b in batch if b is not None]
+
+ input_lengths, ids_sorted_decreasing = torch.sort(
+ torch.LongTensor([x[0].shape[1] for x in batch]),
+ dim=0, descending=True)
+
+ max_c_len = max([x[0].size(1) for x in batch])
+ max_wav_len = max([x[3].size(1) for x in batch])
+
+ lengths = torch.LongTensor(len(batch))
+
+ c_padded = torch.FloatTensor(len(batch), batch[0][0].shape[0], max_c_len)
+ f0_padded = torch.FloatTensor(len(batch), max_c_len)
+ spec_padded = torch.FloatTensor(len(batch), batch[0][2].shape[0], max_c_len)
+ wav_padded = torch.FloatTensor(len(batch), 1, max_wav_len)
+ spkids = torch.LongTensor(len(batch), 1)
+ uv_padded = torch.FloatTensor(len(batch), max_c_len)
+
+ c_padded.zero_()
+ spec_padded.zero_()
+ f0_padded.zero_()
+ wav_padded.zero_()
+ uv_padded.zero_()
+
+ for i in range(len(ids_sorted_decreasing)):
+ row = batch[ids_sorted_decreasing[i]]
+
+ c = row[0]
+ c_padded[i, :, :c.size(1)] = c
+ lengths[i] = c.size(1)
+
+ f0 = row[1]
+ f0_padded[i, :f0.size(0)] = f0
+
+ spec = row[2]
+ spec_padded[i, :, :spec.size(1)] = spec
+
+ wav = row[3]
+ wav_padded[i, :, :wav.size(1)] = wav
+
+ spkids[i, 0] = row[4]
+
+ uv = row[5]
+ uv_padded[i, :uv.size(0)] = uv
+
+ return c_padded, f0_padded, spec_padded, wav_padded, spkids, lengths, uv_padded
diff --git a/so-vits-svc/dataset_raw/wav_structure.txt b/so-vits-svc/dataset_raw/wav_structure.txt
new file mode 100644
index 0000000000000000000000000000000000000000..68cee4e98b3512989e01945f600fc276e21637e0
--- /dev/null
+++ b/so-vits-svc/dataset_raw/wav_structure.txt
@@ -0,0 +1,20 @@
+数据集准备
+
+raw
+├───speaker0
+│ ├───xxx1-xxx1.wav
+│ ├───...
+│ └───Lxx-0xx8.wav
+└───speaker1
+ ├───xx2-0xxx2.wav
+ ├───...
+ └───xxx7-xxx007.wav
+
+此外还需要编辑config.json
+
+"n_speakers": 10
+
+"spk":{
+ "speaker0": 0,
+ "speaker1": 1,
+}
diff --git a/so-vits-svc/filelists/test.txt b/so-vits-svc/filelists/test.txt
new file mode 100644
index 0000000000000000000000000000000000000000..be640cffb48b3bc39126f9d1b83a3c992fe6e30d
--- /dev/null
+++ b/so-vits-svc/filelists/test.txt
@@ -0,0 +1,4 @@
+./dataset/44k/taffy/000562.wav
+./dataset/44k/nyaru/000011.wav
+./dataset/44k/nyaru/000008.wav
+./dataset/44k/taffy/000563.wav
diff --git a/so-vits-svc/filelists/train.txt b/so-vits-svc/filelists/train.txt
new file mode 100644
index 0000000000000000000000000000000000000000..acdb3ccec870a72f0d4da413e6aea97b36331f03
--- /dev/null
+++ b/so-vits-svc/filelists/train.txt
@@ -0,0 +1,15 @@
+./dataset/44k/taffy/000549.wav
+./dataset/44k/nyaru/000004.wav
+./dataset/44k/nyaru/000006.wav
+./dataset/44k/taffy/000551.wav
+./dataset/44k/nyaru/000009.wav
+./dataset/44k/taffy/000561.wav
+./dataset/44k/nyaru/000001.wav
+./dataset/44k/taffy/000553.wav
+./dataset/44k/nyaru/000002.wav
+./dataset/44k/taffy/000560.wav
+./dataset/44k/taffy/000557.wav
+./dataset/44k/nyaru/000005.wav
+./dataset/44k/taffy/000554.wav
+./dataset/44k/taffy/000550.wav
+./dataset/44k/taffy/000559.wav
diff --git a/so-vits-svc/filelists/val.txt b/so-vits-svc/filelists/val.txt
new file mode 100644
index 0000000000000000000000000000000000000000..262dfc97ec1ec3671138954a5c1490add8875b5b
--- /dev/null
+++ b/so-vits-svc/filelists/val.txt
@@ -0,0 +1,4 @@
+./dataset/44k/nyaru/000003.wav
+./dataset/44k/nyaru/000007.wav
+./dataset/44k/taffy/000558.wav
+./dataset/44k/taffy/000556.wav
diff --git a/so-vits-svc/flask_api.py b/so-vits-svc/flask_api.py
new file mode 100644
index 0000000000000000000000000000000000000000..dff87134620d6ec00e6c8950ccf6313946216af8
--- /dev/null
+++ b/so-vits-svc/flask_api.py
@@ -0,0 +1,62 @@
+import io
+import logging
+
+import soundfile
+import torch
+import torchaudio
+from flask import Flask, request, send_file
+from flask_cors import CORS
+
+from inference.infer_tool import Svc, RealTimeVC
+
+app = Flask(__name__)
+
+CORS(app)
+
+logging.getLogger('numba').setLevel(logging.WARNING)
+
+
+@app.route("/voiceChangeModel", methods=["POST"])
+def voice_change_model():
+ request_form = request.form
+ wave_file = request.files.get("sample", None)
+ # pitch changing information
+ f_pitch_change = float(request_form.get("fPitchChange", 0))
+ # DAW required sampling rate
+ daw_sample = int(float(request_form.get("sampleRate", 0)))
+ speaker_id = int(float(request_form.get("sSpeakId", 0)))
+ # get wav from http and convert
+ input_wav_path = io.BytesIO(wave_file.read())
+
+ # inference
+ if raw_infer:
+ # out_audio, out_sr = svc_model.infer(speaker_id, f_pitch_change, input_wav_path)
+ out_audio, out_sr = svc_model.infer(speaker_id, f_pitch_change, input_wav_path, cluster_infer_ratio=0,
+ auto_predict_f0=False, noice_scale=0.4, f0_filter=False)
+ tar_audio = torchaudio.functional.resample(out_audio, svc_model.target_sample, daw_sample)
+ else:
+ out_audio = svc.process(svc_model, speaker_id, f_pitch_change, input_wav_path, cluster_infer_ratio=0,
+ auto_predict_f0=False, noice_scale=0.4, f0_filter=False)
+ tar_audio = torchaudio.functional.resample(torch.from_numpy(out_audio), svc_model.target_sample, daw_sample)
+ # return
+ out_wav_path = io.BytesIO()
+ soundfile.write(out_wav_path, tar_audio.cpu().numpy(), daw_sample, format="wav")
+ out_wav_path.seek(0)
+ return send_file(out_wav_path, download_name="temp.wav", as_attachment=True)
+
+
+if __name__ == '__main__':
+ # True means splice directly. There may be explosive sounds at the splice.
+ # False means use cross fade. There may be slight overlapping sounds at the splice.
+ # Using 0.3-0.5s in VST plugin can reduce latency.
+ # You can adjust the maximum slicing time of VST plugin to 1 second and set it to ture here to get a stable sound quality and a relatively large delay。
+ # Choose an acceptable method on your own.
+ raw_infer = True
+ # each model and config are corresponding
+ model_name = "logs/32k/G_174000-Copy1.pth"
+ config_name = "configs/config.json"
+ cluster_model_path = "logs/44k/kmeans_10000.pt"
+ svc_model = Svc(model_name, config_name, cluster_model_path=cluster_model_path)
+ svc = RealTimeVC()
+ # corresponding to the vst plugin here
+ app.run(port=6842, host="0.0.0.0", debug=False, threaded=False)
diff --git a/so-vits-svc/flask_api_full_song.py b/so-vits-svc/flask_api_full_song.py
new file mode 100644
index 0000000000000000000000000000000000000000..901cdd064acc5c18a6e353c7ce390c0d39e850ac
--- /dev/null
+++ b/so-vits-svc/flask_api_full_song.py
@@ -0,0 +1,55 @@
+import io
+import numpy as np
+import soundfile
+from flask import Flask, request, send_file
+
+from inference import infer_tool
+from inference import slicer
+
+app = Flask(__name__)
+
+
+@app.route("/wav2wav", methods=["POST"])
+def wav2wav():
+ request_form = request.form
+ audio_path = request_form.get("audio_path", None) # wav path
+ tran = int(float(request_form.get("tran", 0))) # tone
+ spk = request_form.get("spk", 0) # speaker(id or name)
+ wav_format = request_form.get("wav_format", 'wav')
+ infer_tool.format_wav(audio_path)
+ chunks = slicer.cut(audio_path, db_thresh=-40)
+ audio_data, audio_sr = slicer.chunks2audio(audio_path, chunks)
+
+ audio = []
+ for (slice_tag, data) in audio_data:
+ print(f'#=====segment start, {round(len(data) / audio_sr, 3)}s======')
+
+ length = int(np.ceil(len(data) / audio_sr * svc_model.target_sample))
+ if slice_tag:
+ print('jump empty segment')
+ _audio = np.zeros(length)
+ else:
+ # padd
+ pad_len = int(audio_sr * 0.5)
+ data = np.concatenate([np.zeros([pad_len]), data, np.zeros([pad_len])])
+ raw_path = io.BytesIO()
+ soundfile.write(raw_path, data, audio_sr, format="wav")
+ raw_path.seek(0)
+ out_audio, out_sr = svc_model.infer(spk, tran, raw_path)
+ svc_model.clear_empty()
+ _audio = out_audio.cpu().numpy()
+ pad_len = int(svc_model.target_sample * 0.5)
+ _audio = _audio[pad_len:-pad_len]
+
+ audio.extend(list(infer_tool.pad_array(_audio, length)))
+ out_wav_path = io.BytesIO()
+ soundfile.write(out_wav_path, audio, svc_model.target_sample, format=wav_format)
+ out_wav_path.seek(0)
+ return send_file(out_wav_path, download_name=f"temp.{wav_format}", as_attachment=True)
+
+
+if __name__ == '__main__':
+ model_name = "logs/44k/G_60000.pth"
+ config_name = "configs/config.json"
+ svc_model = infer_tool.Svc(model_name, config_name)
+ app.run(port=1145, host="0.0.0.0", debug=False, threaded=False)
diff --git a/so-vits-svc/hubert/__init__.py b/so-vits-svc/hubert/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/so-vits-svc/hubert/hubert_model.py b/so-vits-svc/hubert/hubert_model.py
new file mode 100644
index 0000000000000000000000000000000000000000..7fb642d89b07ca60792debab18e3454f52d8f357
--- /dev/null
+++ b/so-vits-svc/hubert/hubert_model.py
@@ -0,0 +1,222 @@
+import copy
+import random
+from typing import Optional, Tuple
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as t_func
+from torch.nn.modules.utils import consume_prefix_in_state_dict_if_present
+
+
+class Hubert(nn.Module):
+ def __init__(self, num_label_embeddings: int = 100, mask: bool = True):
+ super().__init__()
+ self._mask = mask
+ self.feature_extractor = FeatureExtractor()
+ self.feature_projection = FeatureProjection()
+ self.positional_embedding = PositionalConvEmbedding()
+ self.norm = nn.LayerNorm(768)
+ self.dropout = nn.Dropout(0.1)
+ self.encoder = TransformerEncoder(
+ nn.TransformerEncoderLayer(
+ 768, 12, 3072, activation="gelu", batch_first=True
+ ),
+ 12,
+ )
+ self.proj = nn.Linear(768, 256)
+
+ self.masked_spec_embed = nn.Parameter(torch.FloatTensor(768).uniform_())
+ self.label_embedding = nn.Embedding(num_label_embeddings, 256)
+
+ def mask(self, x: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
+ mask = None
+ if self.training and self._mask:
+ mask = _compute_mask((x.size(0), x.size(1)), 0.8, 10, x.device, 2)
+ x[mask] = self.masked_spec_embed.to(x.dtype)
+ return x, mask
+
+ def encode(
+ self, x: torch.Tensor, layer: Optional[int] = None
+ ) -> Tuple[torch.Tensor, torch.Tensor]:
+ x = self.feature_extractor(x)
+ x = self.feature_projection(x.transpose(1, 2))
+ x, mask = self.mask(x)
+ x = x + self.positional_embedding(x)
+ x = self.dropout(self.norm(x))
+ x = self.encoder(x, output_layer=layer)
+ return x, mask
+
+ def logits(self, x: torch.Tensor) -> torch.Tensor:
+ logits = torch.cosine_similarity(
+ x.unsqueeze(2),
+ self.label_embedding.weight.unsqueeze(0).unsqueeze(0),
+ dim=-1,
+ )
+ return logits / 0.1
+
+ def forward(self, x: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
+ x, mask = self.encode(x)
+ x = self.proj(x)
+ logits = self.logits(x)
+ return logits, mask
+
+
+class HubertSoft(Hubert):
+ def __init__(self):
+ super().__init__()
+
+ @torch.inference_mode()
+ def units(self, wav: torch.Tensor) -> torch.Tensor:
+ wav = t_func.pad(wav, ((400 - 320) // 2, (400 - 320) // 2))
+ x, _ = self.encode(wav)
+ return self.proj(x)
+
+
+class FeatureExtractor(nn.Module):
+ def __init__(self):
+ super().__init__()
+ self.conv0 = nn.Conv1d(1, 512, 10, 5, bias=False)
+ self.norm0 = nn.GroupNorm(512, 512)
+ self.conv1 = nn.Conv1d(512, 512, 3, 2, bias=False)
+ self.conv2 = nn.Conv1d(512, 512, 3, 2, bias=False)
+ self.conv3 = nn.Conv1d(512, 512, 3, 2, bias=False)
+ self.conv4 = nn.Conv1d(512, 512, 3, 2, bias=False)
+ self.conv5 = nn.Conv1d(512, 512, 2, 2, bias=False)
+ self.conv6 = nn.Conv1d(512, 512, 2, 2, bias=False)
+
+ def forward(self, x: torch.Tensor) -> torch.Tensor:
+ x = t_func.gelu(self.norm0(self.conv0(x)))
+ x = t_func.gelu(self.conv1(x))
+ x = t_func.gelu(self.conv2(x))
+ x = t_func.gelu(self.conv3(x))
+ x = t_func.gelu(self.conv4(x))
+ x = t_func.gelu(self.conv5(x))
+ x = t_func.gelu(self.conv6(x))
+ return x
+
+
+class FeatureProjection(nn.Module):
+ def __init__(self):
+ super().__init__()
+ self.norm = nn.LayerNorm(512)
+ self.projection = nn.Linear(512, 768)
+ self.dropout = nn.Dropout(0.1)
+
+ def forward(self, x: torch.Tensor) -> torch.Tensor:
+ x = self.norm(x)
+ x = self.projection(x)
+ x = self.dropout(x)
+ return x
+
+
+class PositionalConvEmbedding(nn.Module):
+ def __init__(self):
+ super().__init__()
+ self.conv = nn.Conv1d(
+ 768,
+ 768,
+ kernel_size=128,
+ padding=128 // 2,
+ groups=16,
+ )
+ self.conv = nn.utils.weight_norm(self.conv, name="weight", dim=2)
+
+ def forward(self, x: torch.Tensor) -> torch.Tensor:
+ x = self.conv(x.transpose(1, 2))
+ x = t_func.gelu(x[:, :, :-1])
+ return x.transpose(1, 2)
+
+
+class TransformerEncoder(nn.Module):
+ def __init__(
+ self, encoder_layer: nn.TransformerEncoderLayer, num_layers: int
+ ) -> None:
+ super(TransformerEncoder, self).__init__()
+ self.layers = nn.ModuleList(
+ [copy.deepcopy(encoder_layer) for _ in range(num_layers)]
+ )
+ self.num_layers = num_layers
+
+ def forward(
+ self,
+ src: torch.Tensor,
+ mask: torch.Tensor = None,
+ src_key_padding_mask: torch.Tensor = None,
+ output_layer: Optional[int] = None,
+ ) -> torch.Tensor:
+ output = src
+ for layer in self.layers[:output_layer]:
+ output = layer(
+ output, src_mask=mask, src_key_padding_mask=src_key_padding_mask
+ )
+ return output
+
+
+def _compute_mask(
+ shape: Tuple[int, int],
+ mask_prob: float,
+ mask_length: int,
+ device: torch.device,
+ min_masks: int = 0,
+) -> torch.Tensor:
+ batch_size, sequence_length = shape
+
+ if mask_length < 1:
+ raise ValueError("`mask_length` has to be bigger than 0.")
+
+ if mask_length > sequence_length:
+ raise ValueError(
+ f"`mask_length` has to be smaller than `sequence_length`, but got `mask_length`: {mask_length} and `sequence_length`: {sequence_length}`"
+ )
+
+ # compute number of masked spans in batch
+ num_masked_spans = int(mask_prob * sequence_length / mask_length + random.random())
+ num_masked_spans = max(num_masked_spans, min_masks)
+
+ # make sure num masked indices <= sequence_length
+ if num_masked_spans * mask_length > sequence_length:
+ num_masked_spans = sequence_length // mask_length
+
+ # SpecAugment mask to fill
+ mask = torch.zeros((batch_size, sequence_length), device=device, dtype=torch.bool)
+
+ # uniform distribution to sample from, make sure that offset samples are < sequence_length
+ uniform_dist = torch.ones(
+ (batch_size, sequence_length - (mask_length - 1)), device=device
+ )
+
+ # get random indices to mask
+ mask_indices = torch.multinomial(uniform_dist, num_masked_spans)
+
+ # expand masked indices to masked spans
+ mask_indices = (
+ mask_indices.unsqueeze(dim=-1)
+ .expand((batch_size, num_masked_spans, mask_length))
+ .reshape(batch_size, num_masked_spans * mask_length)
+ )
+ offsets = (
+ torch.arange(mask_length, device=device)[None, None, :]
+ .expand((batch_size, num_masked_spans, mask_length))
+ .reshape(batch_size, num_masked_spans * mask_length)
+ )
+ mask_idxs = mask_indices + offsets
+
+ # scatter indices to mask
+ mask = mask.scatter(1, mask_idxs, True)
+
+ return mask
+
+
+def hubert_soft(
+ path: str,
+) -> HubertSoft:
+ r"""HuBERT-Soft from `"A Comparison of Discrete and Soft Speech Units for Improved Voice Conversion"`.
+ Args:
+ path (str): path of a pretrained model
+ """
+ hubert = HubertSoft()
+ checkpoint = torch.load(path)
+ consume_prefix_in_state_dict_if_present(checkpoint, "module.")
+ hubert.load_state_dict(checkpoint)
+ hubert.eval()
+ return hubert
diff --git a/so-vits-svc/hubert/hubert_model_onnx.py b/so-vits-svc/hubert/hubert_model_onnx.py
new file mode 100644
index 0000000000000000000000000000000000000000..d18f3c2a0fc29592a573a9780308d38f059640b9
--- /dev/null
+++ b/so-vits-svc/hubert/hubert_model_onnx.py
@@ -0,0 +1,217 @@
+import copy
+import random
+from typing import Optional, Tuple
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as t_func
+from torch.nn.modules.utils import consume_prefix_in_state_dict_if_present
+
+
+class Hubert(nn.Module):
+ def __init__(self, num_label_embeddings: int = 100, mask: bool = True):
+ super().__init__()
+ self._mask = mask
+ self.feature_extractor = FeatureExtractor()
+ self.feature_projection = FeatureProjection()
+ self.positional_embedding = PositionalConvEmbedding()
+ self.norm = nn.LayerNorm(768)
+ self.dropout = nn.Dropout(0.1)
+ self.encoder = TransformerEncoder(
+ nn.TransformerEncoderLayer(
+ 768, 12, 3072, activation="gelu", batch_first=True
+ ),
+ 12,
+ )
+ self.proj = nn.Linear(768, 256)
+
+ self.masked_spec_embed = nn.Parameter(torch.FloatTensor(768).uniform_())
+ self.label_embedding = nn.Embedding(num_label_embeddings, 256)
+
+ def mask(self, x: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
+ mask = None
+ if self.training and self._mask:
+ mask = _compute_mask((x.size(0), x.size(1)), 0.8, 10, x.device, 2)
+ x[mask] = self.masked_spec_embed.to(x.dtype)
+ return x, mask
+
+ def encode(
+ self, x: torch.Tensor, layer: Optional[int] = None
+ ) -> Tuple[torch.Tensor, torch.Tensor]:
+ x = self.feature_extractor(x)
+ x = self.feature_projection(x.transpose(1, 2))
+ x, mask = self.mask(x)
+ x = x + self.positional_embedding(x)
+ x = self.dropout(self.norm(x))
+ x = self.encoder(x, output_layer=layer)
+ return x, mask
+
+ def logits(self, x: torch.Tensor) -> torch.Tensor:
+ logits = torch.cosine_similarity(
+ x.unsqueeze(2),
+ self.label_embedding.weight.unsqueeze(0).unsqueeze(0),
+ dim=-1,
+ )
+ return logits / 0.1
+
+
+class HubertSoft(Hubert):
+ def __init__(self):
+ super().__init__()
+
+ def units(self, wav: torch.Tensor) -> torch.Tensor:
+ wav = t_func.pad(wav, ((400 - 320) // 2, (400 - 320) // 2))
+ x, _ = self.encode(wav)
+ return self.proj(x)
+
+ def forward(self, x):
+ return self.units(x)
+
+class FeatureExtractor(nn.Module):
+ def __init__(self):
+ super().__init__()
+ self.conv0 = nn.Conv1d(1, 512, 10, 5, bias=False)
+ self.norm0 = nn.GroupNorm(512, 512)
+ self.conv1 = nn.Conv1d(512, 512, 3, 2, bias=False)
+ self.conv2 = nn.Conv1d(512, 512, 3, 2, bias=False)
+ self.conv3 = nn.Conv1d(512, 512, 3, 2, bias=False)
+ self.conv4 = nn.Conv1d(512, 512, 3, 2, bias=False)
+ self.conv5 = nn.Conv1d(512, 512, 2, 2, bias=False)
+ self.conv6 = nn.Conv1d(512, 512, 2, 2, bias=False)
+
+ def forward(self, x: torch.Tensor) -> torch.Tensor:
+ x = t_func.gelu(self.norm0(self.conv0(x)))
+ x = t_func.gelu(self.conv1(x))
+ x = t_func.gelu(self.conv2(x))
+ x = t_func.gelu(self.conv3(x))
+ x = t_func.gelu(self.conv4(x))
+ x = t_func.gelu(self.conv5(x))
+ x = t_func.gelu(self.conv6(x))
+ return x
+
+
+class FeatureProjection(nn.Module):
+ def __init__(self):
+ super().__init__()
+ self.norm = nn.LayerNorm(512)
+ self.projection = nn.Linear(512, 768)
+ self.dropout = nn.Dropout(0.1)
+
+ def forward(self, x: torch.Tensor) -> torch.Tensor:
+ x = self.norm(x)
+ x = self.projection(x)
+ x = self.dropout(x)
+ return x
+
+
+class PositionalConvEmbedding(nn.Module):
+ def __init__(self):
+ super().__init__()
+ self.conv = nn.Conv1d(
+ 768,
+ 768,
+ kernel_size=128,
+ padding=128 // 2,
+ groups=16,
+ )
+ self.conv = nn.utils.weight_norm(self.conv, name="weight", dim=2)
+
+ def forward(self, x: torch.Tensor) -> torch.Tensor:
+ x = self.conv(x.transpose(1, 2))
+ x = t_func.gelu(x[:, :, :-1])
+ return x.transpose(1, 2)
+
+
+class TransformerEncoder(nn.Module):
+ def __init__(
+ self, encoder_layer: nn.TransformerEncoderLayer, num_layers: int
+ ) -> None:
+ super(TransformerEncoder, self).__init__()
+ self.layers = nn.ModuleList(
+ [copy.deepcopy(encoder_layer) for _ in range(num_layers)]
+ )
+ self.num_layers = num_layers
+
+ def forward(
+ self,
+ src: torch.Tensor,
+ mask: torch.Tensor = None,
+ src_key_padding_mask: torch.Tensor = None,
+ output_layer: Optional[int] = None,
+ ) -> torch.Tensor:
+ output = src
+ for layer in self.layers[:output_layer]:
+ output = layer(
+ output, src_mask=mask, src_key_padding_mask=src_key_padding_mask
+ )
+ return output
+
+
+def _compute_mask(
+ shape: Tuple[int, int],
+ mask_prob: float,
+ mask_length: int,
+ device: torch.device,
+ min_masks: int = 0,
+) -> torch.Tensor:
+ batch_size, sequence_length = shape
+
+ if mask_length < 1:
+ raise ValueError("`mask_length` has to be bigger than 0.")
+
+ if mask_length > sequence_length:
+ raise ValueError(
+ f"`mask_length` has to be smaller than `sequence_length`, but got `mask_length`: {mask_length} and `sequence_length`: {sequence_length}`"
+ )
+
+ # compute number of masked spans in batch
+ num_masked_spans = int(mask_prob * sequence_length / mask_length + random.random())
+ num_masked_spans = max(num_masked_spans, min_masks)
+
+ # make sure num masked indices <= sequence_length
+ if num_masked_spans * mask_length > sequence_length:
+ num_masked_spans = sequence_length // mask_length
+
+ # SpecAugment mask to fill
+ mask = torch.zeros((batch_size, sequence_length), device=device, dtype=torch.bool)
+
+ # uniform distribution to sample from, make sure that offset samples are < sequence_length
+ uniform_dist = torch.ones(
+ (batch_size, sequence_length - (mask_length - 1)), device=device
+ )
+
+ # get random indices to mask
+ mask_indices = torch.multinomial(uniform_dist, num_masked_spans)
+
+ # expand masked indices to masked spans
+ mask_indices = (
+ mask_indices.unsqueeze(dim=-1)
+ .expand((batch_size, num_masked_spans, mask_length))
+ .reshape(batch_size, num_masked_spans * mask_length)
+ )
+ offsets = (
+ torch.arange(mask_length, device=device)[None, None, :]
+ .expand((batch_size, num_masked_spans, mask_length))
+ .reshape(batch_size, num_masked_spans * mask_length)
+ )
+ mask_idxs = mask_indices + offsets
+
+ # scatter indices to mask
+ mask = mask.scatter(1, mask_idxs, True)
+
+ return mask
+
+
+def hubert_soft(
+ path: str,
+) -> HubertSoft:
+ r"""HuBERT-Soft from `"A Comparison of Discrete and Soft Speech Units for Improved Voice Conversion"`.
+ Args:
+ path (str): path of a pretrained model
+ """
+ hubert = HubertSoft()
+ checkpoint = torch.load(path)
+ consume_prefix_in_state_dict_if_present(checkpoint, "module.")
+ hubert.load_state_dict(checkpoint)
+ hubert.eval()
+ return hubert
diff --git a/so-vits-svc/hubert/put_hubert_ckpt_here b/so-vits-svc/hubert/put_hubert_ckpt_here
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/so-vits-svc/inference/__init__.py b/so-vits-svc/inference/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/so-vits-svc/inference/infer_tool.py b/so-vits-svc/inference/infer_tool.py
new file mode 100644
index 0000000000000000000000000000000000000000..91561cfbfc61f3bf7334b10e8e7242574c5ed061
--- /dev/null
+++ b/so-vits-svc/inference/infer_tool.py
@@ -0,0 +1,354 @@
+import hashlib
+import io
+import json
+import logging
+import os
+import time
+from pathlib import Path
+from inference import slicer
+import gc
+
+import librosa
+import numpy as np
+# import onnxruntime
+import parselmouth
+import soundfile
+import torch
+import torchaudio
+
+import cluster
+from hubert import hubert_model
+import utils
+from models import SynthesizerTrn
+
+logging.getLogger('matplotlib').setLevel(logging.WARNING)
+
+
+def read_temp(file_name):
+ if not os.path.exists(file_name):
+ with open(file_name, "w") as f:
+ f.write(json.dumps({"info": "temp_dict"}))
+ return {}
+ else:
+ try:
+ with open(file_name, "r") as f:
+ data = f.read()
+ data_dict = json.loads(data)
+ if os.path.getsize(file_name) > 50 * 1024 * 1024:
+ f_name = file_name.replace("\\", "/").split("/")[-1]
+ print(f"clean {f_name}")
+ for wav_hash in list(data_dict.keys()):
+ if int(time.time()) - int(data_dict[wav_hash]["time"]) > 14 * 24 * 3600:
+ del data_dict[wav_hash]
+ except Exception as e:
+ print(e)
+ print(f"{file_name} error,auto rebuild file")
+ data_dict = {"info": "temp_dict"}
+ return data_dict
+
+
+def write_temp(file_name, data):
+ with open(file_name, "w") as f:
+ f.write(json.dumps(data))
+
+
+def timeit(func):
+ def run(*args, **kwargs):
+ t = time.time()
+ res = func(*args, **kwargs)
+ print('executing \'%s\' costed %.3fs' % (func.__name__, time.time() - t))
+ return res
+
+ return run
+
+
+def format_wav(audio_path):
+ if Path(audio_path).suffix == '.wav':
+ return
+ raw_audio, raw_sample_rate = librosa.load(audio_path, mono=True, sr=None)
+ soundfile.write(Path(audio_path).with_suffix(".wav"), raw_audio, raw_sample_rate)
+
+
+def get_end_file(dir_path, end):
+ file_lists = []
+ for root, dirs, files in os.walk(dir_path):
+ files = [f for f in files if f[0] != '.']
+ dirs[:] = [d for d in dirs if d[0] != '.']
+ for f_file in files:
+ if f_file.endswith(end):
+ file_lists.append(os.path.join(root, f_file).replace("\\", "/"))
+ return file_lists
+
+
+def get_md5(content):
+ return hashlib.new("md5", content).hexdigest()
+
+def fill_a_to_b(a, b):
+ if len(a) < len(b):
+ for _ in range(0, len(b) - len(a)):
+ a.append(a[0])
+
+def mkdir(paths: list):
+ for path in paths:
+ if not os.path.exists(path):
+ os.mkdir(path)
+
+def pad_array(arr, target_length):
+ current_length = arr.shape[0]
+ if current_length >= target_length:
+ return arr
+ else:
+ pad_width = target_length - current_length
+ pad_left = pad_width // 2
+ pad_right = pad_width - pad_left
+ padded_arr = np.pad(arr, (pad_left, pad_right), 'constant', constant_values=(0, 0))
+ return padded_arr
+
+def split_list_by_n(list_collection, n, pre=0):
+ for i in range(0, len(list_collection), n):
+ yield list_collection[i-pre if i-pre>=0 else i: i + n]
+
+
+class F0FilterException(Exception):
+ pass
+
+class Svc(object):
+ def __init__(self, net_g_path, config_path,
+ device=None,
+ cluster_model_path="logs/44k/kmeans_10000.pt",
+ nsf_hifigan_enhance = False
+ ):
+ self.net_g_path = net_g_path
+ if device is None:
+ self.dev = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+ else:
+ self.dev = torch.device(device)
+ self.net_g_ms = None
+ self.hps_ms = utils.get_hparams_from_file(config_path)
+ self.target_sample = self.hps_ms.data.sampling_rate
+ self.hop_size = self.hps_ms.data.hop_length
+ self.spk2id = self.hps_ms.spk
+ self.nsf_hifigan_enhance = nsf_hifigan_enhance
+ # load hubert
+ self.hubert_model = utils.get_hubert_model().to(self.dev)
+ self.load_model()
+ if os.path.exists(cluster_model_path):
+ self.cluster_model = cluster.get_cluster_model(cluster_model_path)
+ if self.nsf_hifigan_enhance:
+ from modules.enhancer import Enhancer
+ self.enhancer = Enhancer('nsf-hifigan', 'pretrain/nsf_hifigan/model',device=self.dev)
+
+ def load_model(self):
+ # get model configuration
+ self.net_g_ms = SynthesizerTrn(
+ self.hps_ms.data.filter_length // 2 + 1,
+ self.hps_ms.train.segment_size // self.hps_ms.data.hop_length,
+ **self.hps_ms.model)
+ _ = utils.load_checkpoint(self.net_g_path, self.net_g_ms, None)
+ if "half" in self.net_g_path and torch.cuda.is_available():
+ _ = self.net_g_ms.half().eval().to(self.dev)
+ else:
+ _ = self.net_g_ms.eval().to(self.dev)
+
+
+
+ def get_unit_f0(self, in_path, tran, cluster_infer_ratio, speaker, f0_filter ,F0_mean_pooling,cr_threshold=0.05):
+
+ wav, sr = librosa.load(in_path, sr=self.target_sample)
+
+ if F0_mean_pooling == True:
+ f0, uv = utils.compute_f0_uv_torchcrepe(torch.FloatTensor(wav), sampling_rate=self.target_sample, hop_length=self.hop_size,device=self.dev,cr_threshold = cr_threshold)
+ if f0_filter and sum(f0) == 0:
+ raise F0FilterException("No voice detected")
+ f0 = torch.FloatTensor(list(f0))
+ uv = torch.FloatTensor(list(uv))
+ if F0_mean_pooling == False:
+ f0 = utils.compute_f0_parselmouth(wav, sampling_rate=self.target_sample, hop_length=self.hop_size)
+ if f0_filter and sum(f0) == 0:
+ raise F0FilterException("No voice detected")
+ f0, uv = utils.interpolate_f0(f0)
+ f0 = torch.FloatTensor(f0)
+ uv = torch.FloatTensor(uv)
+
+ f0 = f0 * 2 ** (tran / 12)
+ f0 = f0.unsqueeze(0).to(self.dev)
+ uv = uv.unsqueeze(0).to(self.dev)
+
+ wav16k = librosa.resample(wav, orig_sr=self.target_sample, target_sr=16000)
+ wav16k = torch.from_numpy(wav16k).to(self.dev)
+ c = utils.get_hubert_content(self.hubert_model, wav_16k_tensor=wav16k)
+ c = utils.repeat_expand_2d(c.squeeze(0), f0.shape[1])
+
+ if cluster_infer_ratio !=0:
+ cluster_c = cluster.get_cluster_center_result(self.cluster_model, c.cpu().numpy().T, speaker).T
+ cluster_c = torch.FloatTensor(cluster_c).to(self.dev)
+ c = cluster_infer_ratio * cluster_c + (1 - cluster_infer_ratio) * c
+
+ c = c.unsqueeze(0)
+ return c, f0, uv
+
+ def infer(self, speaker, tran, raw_path,
+ cluster_infer_ratio=0,
+ auto_predict_f0=False,
+ noice_scale=0.4,
+ f0_filter=False,
+ F0_mean_pooling=False,
+ enhancer_adaptive_key = 0,
+ cr_threshold = 0.05
+ ):
+
+ speaker_id = self.spk2id.__dict__.get(speaker)
+ if not speaker_id and type(speaker) is int:
+ if len(self.spk2id.__dict__) >= speaker:
+ speaker_id = speaker
+ sid = torch.LongTensor([int(speaker_id)]).to(self.dev).unsqueeze(0)
+ c, f0, uv = self.get_unit_f0(raw_path, tran, cluster_infer_ratio, speaker, f0_filter,F0_mean_pooling,cr_threshold=cr_threshold)
+ if "half" in self.net_g_path and torch.cuda.is_available():
+ c = c.half()
+ with torch.no_grad():
+ start = time.time()
+ audio = self.net_g_ms.infer(c, f0=f0, g=sid, uv=uv, predict_f0=auto_predict_f0, noice_scale=noice_scale)[0,0].data.float()
+ if self.nsf_hifigan_enhance:
+ audio, _ = self.enhancer.enhance(
+ audio[None,:],
+ self.target_sample,
+ f0[:,:,None],
+ self.hps_ms.data.hop_length,
+ adaptive_key = enhancer_adaptive_key)
+ use_time = time.time() - start
+ print("vits use time:{}".format(use_time))
+ return audio, audio.shape[-1]
+
+ def clear_empty(self):
+ # clean up vram
+ torch.cuda.empty_cache()
+
+ def unload_model(self):
+ # unload model
+ self.net_g_ms = self.net_g_ms.to("cpu")
+ del self.net_g_ms
+ if hasattr(self,"enhancer"):
+ self.enhancer.enhancer = self.enhancer.enhancer.to("cpu")
+ del self.enhancer.enhancer
+ del self.enhancer
+ gc.collect()
+
+ def slice_inference(self,
+ raw_audio_path,
+ spk,
+ tran,
+ slice_db,
+ cluster_infer_ratio,
+ auto_predict_f0,
+ noice_scale,
+ pad_seconds=0.5,
+ clip_seconds=0,
+ lg_num=0,
+ lgr_num =0.75,
+ F0_mean_pooling = False,
+ enhancer_adaptive_key = 0,
+ cr_threshold = 0.05
+ ):
+ wav_path = raw_audio_path
+ chunks = slicer.cut(wav_path, db_thresh=slice_db)
+ audio_data, audio_sr = slicer.chunks2audio(wav_path, chunks)
+ per_size = int(clip_seconds*audio_sr)
+ lg_size = int(lg_num*audio_sr)
+ lg_size_r = int(lg_size*lgr_num)
+ lg_size_c_l = (lg_size-lg_size_r)//2
+ lg_size_c_r = lg_size-lg_size_r-lg_size_c_l
+ lg = np.linspace(0,1,lg_size_r) if lg_size!=0 else 0
+
+ audio = []
+ for (slice_tag, data) in audio_data:
+ print(f'#=====segment start, {round(len(data) / audio_sr, 3)}s======')
+ # padd
+ length = int(np.ceil(len(data) / audio_sr * self.target_sample))
+ if slice_tag:
+ print('jump empty segment')
+ _audio = np.zeros(length)
+ audio.extend(list(pad_array(_audio, length)))
+ continue
+ if per_size != 0:
+ datas = split_list_by_n(data, per_size,lg_size)
+ else:
+ datas = [data]
+ for k,dat in enumerate(datas):
+ per_length = int(np.ceil(len(dat) / audio_sr * self.target_sample)) if clip_seconds!=0 else length
+ if clip_seconds!=0: print(f'###=====segment clip start, {round(len(dat) / audio_sr, 3)}s======')
+ # padd
+ pad_len = int(audio_sr * pad_seconds)
+ dat = np.concatenate([np.zeros([pad_len]), dat, np.zeros([pad_len])])
+ raw_path = io.BytesIO()
+ soundfile.write(raw_path, dat, audio_sr, format="wav")
+ raw_path.seek(0)
+ out_audio, out_sr = self.infer(spk, tran, raw_path,
+ cluster_infer_ratio=cluster_infer_ratio,
+ auto_predict_f0=auto_predict_f0,
+ noice_scale=noice_scale,
+ F0_mean_pooling = F0_mean_pooling,
+ enhancer_adaptive_key = enhancer_adaptive_key,
+ cr_threshold = cr_threshold
+ )
+ _audio = out_audio.cpu().numpy()
+ pad_len = int(self.target_sample * pad_seconds)
+ _audio = _audio[pad_len:-pad_len]
+ _audio = pad_array(_audio, per_length)
+ if lg_size!=0 and k!=0:
+ lg1 = audio[-(lg_size_r+lg_size_c_r):-lg_size_c_r] if lgr_num != 1 else audio[-lg_size:]
+ lg2 = _audio[lg_size_c_l:lg_size_c_l+lg_size_r] if lgr_num != 1 else _audio[0:lg_size]
+ lg_pre = lg1*(1-lg)+lg2*lg
+ audio = audio[0:-(lg_size_r+lg_size_c_r)] if lgr_num != 1 else audio[0:-lg_size]
+ audio.extend(lg_pre)
+ _audio = _audio[lg_size_c_l+lg_size_r:] if lgr_num != 1 else _audio[lg_size:]
+ audio.extend(list(_audio))
+ return np.array(audio)
+
+class RealTimeVC:
+ def __init__(self):
+ self.last_chunk = None
+ self.last_o = None
+ self.chunk_len = 16000 # chunk length
+ self.pre_len = 3840 # cross fade length, multiples of 640
+
+ # Input and output are 1-dimensional numpy waveform arrays
+
+ def process(self, svc_model, speaker_id, f_pitch_change, input_wav_path,
+ cluster_infer_ratio=0,
+ auto_predict_f0=False,
+ noice_scale=0.4,
+ f0_filter=False):
+
+ import maad
+ audio, sr = torchaudio.load(input_wav_path)
+ audio = audio.cpu().numpy()[0]
+ temp_wav = io.BytesIO()
+ if self.last_chunk is None:
+ input_wav_path.seek(0)
+
+ audio, sr = svc_model.infer(speaker_id, f_pitch_change, input_wav_path,
+ cluster_infer_ratio=cluster_infer_ratio,
+ auto_predict_f0=auto_predict_f0,
+ noice_scale=noice_scale,
+ f0_filter=f0_filter)
+
+ audio = audio.cpu().numpy()
+ self.last_chunk = audio[-self.pre_len:]
+ self.last_o = audio
+ return audio[-self.chunk_len:]
+ else:
+ audio = np.concatenate([self.last_chunk, audio])
+ soundfile.write(temp_wav, audio, sr, format="wav")
+ temp_wav.seek(0)
+
+ audio, sr = svc_model.infer(speaker_id, f_pitch_change, temp_wav,
+ cluster_infer_ratio=cluster_infer_ratio,
+ auto_predict_f0=auto_predict_f0,
+ noice_scale=noice_scale,
+ f0_filter=f0_filter)
+
+ audio = audio.cpu().numpy()
+ ret = maad.util.crossfade(self.last_o, audio, self.pre_len)
+ self.last_chunk = audio[-self.pre_len:]
+ self.last_o = audio
+ return ret[self.chunk_len:2 * self.chunk_len]
diff --git a/so-vits-svc/inference/infer_tool_grad.py b/so-vits-svc/inference/infer_tool_grad.py
new file mode 100644
index 0000000000000000000000000000000000000000..b75af49c08e2e724839828bc419792ed580809bb
--- /dev/null
+++ b/so-vits-svc/inference/infer_tool_grad.py
@@ -0,0 +1,160 @@
+import hashlib
+import json
+import logging
+import os
+import time
+from pathlib import Path
+import io
+import librosa
+import maad
+import numpy as np
+from inference import slicer
+import parselmouth
+import soundfile
+import torch
+import torchaudio
+
+from hubert import hubert_model
+import utils
+from models import SynthesizerTrn
+logging.getLogger('numba').setLevel(logging.WARNING)
+logging.getLogger('matplotlib').setLevel(logging.WARNING)
+
+def resize2d_f0(x, target_len):
+ source = np.array(x)
+ source[source < 0.001] = np.nan
+ target = np.interp(np.arange(0, len(source) * target_len, len(source)) / target_len, np.arange(0, len(source)),
+ source)
+ res = np.nan_to_num(target)
+ return res
+
+def get_f0(x, p_len,f0_up_key=0):
+
+ time_step = 160 / 16000 * 1000
+ f0_min = 50
+ f0_max = 1100
+ f0_mel_min = 1127 * np.log(1 + f0_min / 700)
+ f0_mel_max = 1127 * np.log(1 + f0_max / 700)
+
+ f0 = parselmouth.Sound(x, 16000).to_pitch_ac(
+ time_step=time_step / 1000, voicing_threshold=0.6,
+ pitch_floor=f0_min, pitch_ceiling=f0_max).selected_array['frequency']
+
+ pad_size=(p_len - len(f0) + 1) // 2
+ if(pad_size>0 or p_len - len(f0) - pad_size>0):
+ f0 = np.pad(f0,[[pad_size,p_len - len(f0) - pad_size]], mode='constant')
+
+ f0 *= pow(2, f0_up_key / 12)
+ f0_mel = 1127 * np.log(1 + f0 / 700)
+ f0_mel[f0_mel > 0] = (f0_mel[f0_mel > 0] - f0_mel_min) * 254 / (f0_mel_max - f0_mel_min) + 1
+ f0_mel[f0_mel <= 1] = 1
+ f0_mel[f0_mel > 255] = 255
+ f0_coarse = np.rint(f0_mel).astype(np.int)
+ return f0_coarse, f0
+
+def clean_pitch(input_pitch):
+ num_nan = np.sum(input_pitch == 1)
+ if num_nan / len(input_pitch) > 0.9:
+ input_pitch[input_pitch != 1] = 1
+ return input_pitch
+
+
+def plt_pitch(input_pitch):
+ input_pitch = input_pitch.astype(float)
+ input_pitch[input_pitch == 1] = np.nan
+ return input_pitch
+
+
+def f0_to_pitch(ff):
+ f0_pitch = 69 + 12 * np.log2(ff / 440)
+ return f0_pitch
+
+
+def fill_a_to_b(a, b):
+ if len(a) < len(b):
+ for _ in range(0, len(b) - len(a)):
+ a.append(a[0])
+
+
+def mkdir(paths: list):
+ for path in paths:
+ if not os.path.exists(path):
+ os.mkdir(path)
+
+
+class VitsSvc(object):
+ def __init__(self):
+ self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+ self.SVCVITS = None
+ self.hps = None
+ self.speakers = None
+ self.hubert_soft = utils.get_hubert_model()
+
+ def set_device(self, device):
+ self.device = torch.device(device)
+ self.hubert_soft.to(self.device)
+ if self.SVCVITS != None:
+ self.SVCVITS.to(self.device)
+
+ def loadCheckpoint(self, path):
+ self.hps = utils.get_hparams_from_file(f"checkpoints/{path}/config.json")
+ self.SVCVITS = SynthesizerTrn(
+ self.hps.data.filter_length // 2 + 1,
+ self.hps.train.segment_size // self.hps.data.hop_length,
+ **self.hps.model)
+ _ = utils.load_checkpoint(f"checkpoints/{path}/model.pth", self.SVCVITS, None)
+ _ = self.SVCVITS.eval().to(self.device)
+ self.speakers = self.hps.spk
+
+ def get_units(self, source, sr):
+ source = source.unsqueeze(0).to(self.device)
+ with torch.inference_mode():
+ units = self.hubert_soft.units(source)
+ return units
+
+
+ def get_unit_pitch(self, in_path, tran):
+ source, sr = torchaudio.load(in_path)
+ source = torchaudio.functional.resample(source, sr, 16000)
+ if len(source.shape) == 2 and source.shape[1] >= 2:
+ source = torch.mean(source, dim=0).unsqueeze(0)
+ soft = self.get_units(source, sr).squeeze(0).cpu().numpy()
+ f0_coarse, f0 = get_f0(source.cpu().numpy()[0], soft.shape[0]*2, tran)
+ return soft, f0
+
+ def infer(self, speaker_id, tran, raw_path):
+ speaker_id = self.speakers[speaker_id]
+ sid = torch.LongTensor([int(speaker_id)]).to(self.device).unsqueeze(0)
+ soft, pitch = self.get_unit_pitch(raw_path, tran)
+ f0 = torch.FloatTensor(clean_pitch(pitch)).unsqueeze(0).to(self.device)
+ stn_tst = torch.FloatTensor(soft)
+ with torch.no_grad():
+ x_tst = stn_tst.unsqueeze(0).to(self.device)
+ x_tst = torch.repeat_interleave(x_tst, repeats=2, dim=1).transpose(1, 2)
+ audio = self.SVCVITS.infer(x_tst, f0=f0, g=sid)[0,0].data.float()
+ return audio, audio.shape[-1]
+
+ def inference(self,srcaudio,chara,tran,slice_db):
+ sampling_rate, audio = srcaudio
+ audio = (audio / np.iinfo(audio.dtype).max).astype(np.float32)
+ if len(audio.shape) > 1:
+ audio = librosa.to_mono(audio.transpose(1, 0))
+ if sampling_rate != 16000:
+ audio = librosa.resample(audio, orig_sr=sampling_rate, target_sr=16000)
+ soundfile.write("tmpwav.wav", audio, 16000, format="wav")
+ chunks = slicer.cut("tmpwav.wav", db_thresh=slice_db)
+ audio_data, audio_sr = slicer.chunks2audio("tmpwav.wav", chunks)
+ audio = []
+ for (slice_tag, data) in audio_data:
+ length = int(np.ceil(len(data) / audio_sr * self.hps.data.sampling_rate))
+ raw_path = io.BytesIO()
+ soundfile.write(raw_path, data, audio_sr, format="wav")
+ raw_path.seek(0)
+ if slice_tag:
+ _audio = np.zeros(length)
+ else:
+ out_audio, out_sr = self.infer(chara, tran, raw_path)
+ _audio = out_audio.cpu().numpy()
+ audio.extend(list(_audio))
+ audio = (np.array(audio) * 32768.0).astype('int16')
+ return (self.hps.data.sampling_rate,audio)
diff --git a/so-vits-svc/inference/slicer.py b/so-vits-svc/inference/slicer.py
new file mode 100644
index 0000000000000000000000000000000000000000..afb31b7af1cdf8310ea42968d1857af6f15d73e4
--- /dev/null
+++ b/so-vits-svc/inference/slicer.py
@@ -0,0 +1,142 @@
+import librosa
+import torch
+import torchaudio
+
+
+class Slicer:
+ def __init__(self,
+ sr: int,
+ threshold: float = -40.,
+ min_length: int = 5000,
+ min_interval: int = 300,
+ hop_size: int = 20,
+ max_sil_kept: int = 5000):
+ if not min_length >= min_interval >= hop_size:
+ raise ValueError('The following condition must be satisfied: min_length >= min_interval >= hop_size')
+ if not max_sil_kept >= hop_size:
+ raise ValueError('The following condition must be satisfied: max_sil_kept >= hop_size')
+ min_interval = sr * min_interval / 1000
+ self.threshold = 10 ** (threshold / 20.)
+ self.hop_size = round(sr * hop_size / 1000)
+ self.win_size = min(round(min_interval), 4 * self.hop_size)
+ self.min_length = round(sr * min_length / 1000 / self.hop_size)
+ self.min_interval = round(min_interval / self.hop_size)
+ self.max_sil_kept = round(sr * max_sil_kept / 1000 / self.hop_size)
+
+ def _apply_slice(self, waveform, begin, end):
+ if len(waveform.shape) > 1:
+ return waveform[:, begin * self.hop_size: min(waveform.shape[1], end * self.hop_size)]
+ else:
+ return waveform[begin * self.hop_size: min(waveform.shape[0], end * self.hop_size)]
+
+ # @timeit
+ def slice(self, waveform):
+ if len(waveform.shape) > 1:
+ samples = librosa.to_mono(waveform)
+ else:
+ samples = waveform
+ if samples.shape[0] <= self.min_length:
+ return {"0": {"slice": False, "split_time": f"0,{len(waveform)}"}}
+ rms_list = librosa.feature.rms(y=samples, frame_length=self.win_size, hop_length=self.hop_size).squeeze(0)
+ sil_tags = []
+ silence_start = None
+ clip_start = 0
+ for i, rms in enumerate(rms_list):
+ # Keep looping while frame is silent.
+ if rms < self.threshold:
+ # Record start of silent frames.
+ if silence_start is None:
+ silence_start = i
+ continue
+ # Keep looping while frame is not silent and silence start has not been recorded.
+ if silence_start is None:
+ continue
+ # Clear recorded silence start if interval is not enough or clip is too short
+ is_leading_silence = silence_start == 0 and i > self.max_sil_kept
+ need_slice_middle = i - silence_start >= self.min_interval and i - clip_start >= self.min_length
+ if not is_leading_silence and not need_slice_middle:
+ silence_start = None
+ continue
+ # Need slicing. Record the range of silent frames to be removed.
+ if i - silence_start <= self.max_sil_kept:
+ pos = rms_list[silence_start: i + 1].argmin() + silence_start
+ if silence_start == 0:
+ sil_tags.append((0, pos))
+ else:
+ sil_tags.append((pos, pos))
+ clip_start = pos
+ elif i - silence_start <= self.max_sil_kept * 2:
+ pos = rms_list[i - self.max_sil_kept: silence_start + self.max_sil_kept + 1].argmin()
+ pos += i - self.max_sil_kept
+ pos_l = rms_list[silence_start: silence_start + self.max_sil_kept + 1].argmin() + silence_start
+ pos_r = rms_list[i - self.max_sil_kept: i + 1].argmin() + i - self.max_sil_kept
+ if silence_start == 0:
+ sil_tags.append((0, pos_r))
+ clip_start = pos_r
+ else:
+ sil_tags.append((min(pos_l, pos), max(pos_r, pos)))
+ clip_start = max(pos_r, pos)
+ else:
+ pos_l = rms_list[silence_start: silence_start + self.max_sil_kept + 1].argmin() + silence_start
+ pos_r = rms_list[i - self.max_sil_kept: i + 1].argmin() + i - self.max_sil_kept
+ if silence_start == 0:
+ sil_tags.append((0, pos_r))
+ else:
+ sil_tags.append((pos_l, pos_r))
+ clip_start = pos_r
+ silence_start = None
+ # Deal with trailing silence.
+ total_frames = rms_list.shape[0]
+ if silence_start is not None and total_frames - silence_start >= self.min_interval:
+ silence_end = min(total_frames, silence_start + self.max_sil_kept)
+ pos = rms_list[silence_start: silence_end + 1].argmin() + silence_start
+ sil_tags.append((pos, total_frames + 1))
+ # Apply and return slices.
+ if len(sil_tags) == 0:
+ return {"0": {"slice": False, "split_time": f"0,{len(waveform)}"}}
+ else:
+ chunks = []
+ # The first segment is not the beginning of the audio.
+ if sil_tags[0][0]:
+ chunks.append(
+ {"slice": False, "split_time": f"0,{min(waveform.shape[0], sil_tags[0][0] * self.hop_size)}"})
+ for i in range(0, len(sil_tags)):
+ # Mark audio segment. Skip the first segment.
+ if i:
+ chunks.append({"slice": False,
+ "split_time": f"{sil_tags[i - 1][1] * self.hop_size},{min(waveform.shape[0], sil_tags[i][0] * self.hop_size)}"})
+ # Mark all mute segments
+ chunks.append({"slice": True,
+ "split_time": f"{sil_tags[i][0] * self.hop_size},{min(waveform.shape[0], sil_tags[i][1] * self.hop_size)}"})
+ # The last segment is not the end.
+ if sil_tags[-1][1] * self.hop_size < len(waveform):
+ chunks.append({"slice": False, "split_time": f"{sil_tags[-1][1] * self.hop_size},{len(waveform)}"})
+ chunk_dict = {}
+ for i in range(len(chunks)):
+ chunk_dict[str(i)] = chunks[i]
+ return chunk_dict
+
+
+def cut(audio_path, db_thresh=-30, min_len=5000):
+ audio, sr = librosa.load(audio_path, sr=None)
+ slicer = Slicer(
+ sr=sr,
+ threshold=db_thresh,
+ min_length=min_len
+ )
+ chunks = slicer.slice(audio)
+ return chunks
+
+
+def chunks2audio(audio_path, chunks):
+ chunks = dict(chunks)
+ audio, sr = torchaudio.load(audio_path)
+ if len(audio.shape) == 2 and audio.shape[1] >= 2:
+ audio = torch.mean(audio, dim=0).unsqueeze(0)
+ audio = audio.cpu().numpy()[0]
+ result = []
+ for k, v in chunks.items():
+ tag = v["split_time"].split(",")
+ if tag[0] != tag[1]:
+ result.append((v["slice"], audio[int(tag[0]):int(tag[1])]))
+ return result, sr
diff --git a/so-vits-svc/inference_main.py b/so-vits-svc/inference_main.py
new file mode 100644
index 0000000000000000000000000000000000000000..df11f499b1648755c923d530bdac359cc577a80b
--- /dev/null
+++ b/so-vits-svc/inference_main.py
@@ -0,0 +1,161 @@
+import io
+import logging
+import time
+from pathlib import Path
+
+import librosa
+import matplotlib.pyplot as plt
+import numpy as np
+import soundfile
+
+from inference import infer_tool
+from inference import slicer
+from inference.infer_tool import Svc
+
+logging.getLogger('numba').setLevel(logging.WARNING)
+chunks_dict = infer_tool.read_temp("inference/chunks_temp.json")
+
+
+
+def main():
+ import argparse
+
+ parser = argparse.ArgumentParser(description='sovits4 inference')
+
+ # Required
+ parser.add_argument('-m', '--model_path', type=str, default="logs/44k/G_0.pth",
+ help='Path to the model.')
+ parser.add_argument('-c', '--config_path', type=str, default="configs/config.json",
+ help='Path to the configuration file.')
+ parser.add_argument('-s', '--spk_list', type=str, nargs='+', default=['nen'],
+ help='Target speaker name for conversion.')
+ parser.add_argument('-n', '--clean_names', type=str, nargs='+', default=["君の知らない物語-src.wav"],
+ help='A list of wav file names located in the raw folder.')
+ parser.add_argument('-t', '--trans', type=int, nargs='+', default=[0],
+ help='Pitch adjustment, supports positive and negative (semitone) values.')
+
+ # Optional
+ parser.add_argument('-a', '--auto_predict_f0', action='store_true', default=False,
+ help='Automatic pitch prediction for voice conversion. Do not enable this when converting songs as it can cause serious pitch issues.')
+ parser.add_argument('-cl', '--clip', type=float, default=0,
+ help='Voice forced slicing. Set to 0 to turn off(default), duration in seconds.')
+ parser.add_argument('-lg', '--linear_gradient', type=float, default=0,
+ help='The cross fade length of two audio slices in seconds. If there is a discontinuous voice after forced slicing, you can adjust this value. Otherwise, it is recommended to use. Default 0.')
+ parser.add_argument('-cm', '--cluster_model_path', type=str, default="logs/44k/kmeans_10000.pt",
+ help='Path to the clustering model. Fill in any value if clustering is not trained.')
+ parser.add_argument('-cr', '--cluster_infer_ratio', type=float, default=0,
+ help='Proportion of the clustering solution, range 0-1. Fill in 0 if the clustering model is not trained.')
+ parser.add_argument('-fmp', '--f0_mean_pooling', action='store_true', default=False,
+ help='Apply mean filter (pooling) to f0, which may improve some hoarse sounds. Enabling this option will reduce inference speed.')
+ parser.add_argument('-eh', '--enhance', action='store_true', default=False,
+ help='Whether to use NSF_HIFIGAN enhancer. This option has certain effect on sound quality enhancement for some models with few training sets, but has negative effect on well-trained models, so it is turned off by default.')
+
+ # generally keep default
+ parser.add_argument('-sd', '--slice_db', type=int, default=-40,
+ help='Loudness for automatic slicing. For noisy audio it can be set to -30')
+ parser.add_argument('-d', '--device', type=str, default=None,
+ help='Device used for inference. None means auto selecting.')
+ parser.add_argument('-ns', '--noice_scale', type=float, default=0.4,
+ help='Affect pronunciation and sound quality.')
+ parser.add_argument('-p', '--pad_seconds', type=float, default=0.5,
+ help='Due to unknown reasons, there may be abnormal noise at the beginning and end. It will disappear after padding a short silent segment.')
+ parser.add_argument('-wf', '--wav_format', type=str, default='flac',
+ help='output format')
+ parser.add_argument('-lgr', '--linear_gradient_retain', type=float, default=0.75,
+ help='Proportion of cross length retention, range (0-1]. After forced slicing, the beginning and end of each segment need to be discarded.')
+ parser.add_argument('-eak', '--enhancer_adaptive_key', type=int, default=0,
+ help='Adapt the enhancer to a higher range of sound. The unit is the semitones, default 0.')
+ parser.add_argument('-ft', '--f0_filter_threshold', type=float, default=0.05,
+ help='F0 Filtering threshold: This parameter is valid only when f0_mean_pooling is enabled. Values range from 0 to 1. Reducing this value reduces the probability of being out of tune, but increases matte.')
+
+
+ args = parser.parse_args()
+
+ clean_names = args.clean_names
+ trans = args.trans
+ spk_list = args.spk_list
+ slice_db = args.slice_db
+ wav_format = args.wav_format
+ auto_predict_f0 = args.auto_predict_f0
+ cluster_infer_ratio = args.cluster_infer_ratio
+ noice_scale = args.noice_scale
+ pad_seconds = args.pad_seconds
+ clip = args.clip
+ lg = args.linear_gradient
+ lgr = args.linear_gradient_retain
+ F0_mean_pooling = args.f0_mean_pooling
+ enhance = args.enhance
+ enhancer_adaptive_key = args.enhancer_adaptive_key
+ cr_threshold = args.f0_filter_threshold
+
+ svc_model = Svc(args.model_path, args.config_path, args.device, args.cluster_model_path,enhance)
+ infer_tool.mkdir(["raw", "results"])
+
+ infer_tool.fill_a_to_b(trans, clean_names)
+ for clean_name, tran in zip(clean_names, trans):
+ raw_audio_path = f"raw/{clean_name}"
+ if "." not in raw_audio_path:
+ raw_audio_path += ".wav"
+ infer_tool.format_wav(raw_audio_path)
+ wav_path = Path(raw_audio_path).with_suffix('.wav')
+ chunks = slicer.cut(wav_path, db_thresh=slice_db)
+ audio_data, audio_sr = slicer.chunks2audio(wav_path, chunks)
+ per_size = int(clip*audio_sr)
+ lg_size = int(lg*audio_sr)
+ lg_size_r = int(lg_size*lgr)
+ lg_size_c_l = (lg_size-lg_size_r)//2
+ lg_size_c_r = lg_size-lg_size_r-lg_size_c_l
+ lg = np.linspace(0,1,lg_size_r) if lg_size!=0 else 0
+
+ for spk in spk_list:
+ audio = []
+ for (slice_tag, data) in audio_data:
+ print(f'#=====segment start, {round(len(data) / audio_sr, 3)}s======')
+
+ length = int(np.ceil(len(data) / audio_sr * svc_model.target_sample))
+ if slice_tag:
+ print('jump empty segment')
+ _audio = np.zeros(length)
+ audio.extend(list(infer_tool.pad_array(_audio, length)))
+ continue
+ if per_size != 0:
+ datas = infer_tool.split_list_by_n(data, per_size,lg_size)
+ else:
+ datas = [data]
+ for k,dat in enumerate(datas):
+ per_length = int(np.ceil(len(dat) / audio_sr * svc_model.target_sample)) if clip!=0 else length
+ if clip!=0: print(f'###=====segment clip start, {round(len(dat) / audio_sr, 3)}s======')
+ # padd
+ pad_len = int(audio_sr * pad_seconds)
+ dat = np.concatenate([np.zeros([pad_len]), dat, np.zeros([pad_len])])
+ raw_path = io.BytesIO()
+ soundfile.write(raw_path, dat, audio_sr, format="wav")
+ raw_path.seek(0)
+ out_audio, out_sr = svc_model.infer(spk, tran, raw_path,
+ cluster_infer_ratio=cluster_infer_ratio,
+ auto_predict_f0=auto_predict_f0,
+ noice_scale=noice_scale,
+ F0_mean_pooling = F0_mean_pooling,
+ enhancer_adaptive_key = enhancer_adaptive_key,
+ cr_threshold = cr_threshold
+ )
+ _audio = out_audio.cpu().numpy()
+ pad_len = int(svc_model.target_sample * pad_seconds)
+ _audio = _audio[pad_len:-pad_len]
+ _audio = infer_tool.pad_array(_audio, per_length)
+ if lg_size!=0 and k!=0:
+ lg1 = audio[-(lg_size_r+lg_size_c_r):-lg_size_c_r] if lgr != 1 else audio[-lg_size:]
+ lg2 = _audio[lg_size_c_l:lg_size_c_l+lg_size_r] if lgr != 1 else _audio[0:lg_size]
+ lg_pre = lg1*(1-lg)+lg2*lg
+ audio = audio[0:-(lg_size_r+lg_size_c_r)] if lgr != 1 else audio[0:-lg_size]
+ audio.extend(lg_pre)
+ _audio = _audio[lg_size_c_l+lg_size_r:] if lgr != 1 else _audio[lg_size:]
+ audio.extend(list(_audio))
+ key = "auto" if auto_predict_f0 else f"{tran}key"
+ cluster_name = "" if cluster_infer_ratio == 0 else f"_{cluster_infer_ratio}"
+ res_path = f'./results/{clean_name}_{key}_{spk}{cluster_name}.{wav_format}'
+ soundfile.write(res_path, audio, svc_model.target_sample, format=wav_format)
+ svc_model.clear_empty()
+
+if __name__ == '__main__':
+ main()
diff --git a/so-vits-svc/logs/44k/put_pretrained_model_here b/so-vits-svc/logs/44k/put_pretrained_model_here
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/so-vits-svc/models.py b/so-vits-svc/models.py
new file mode 100644
index 0000000000000000000000000000000000000000..13278d680493970f5a670cf3fc955a6e9b7ab1d5
--- /dev/null
+++ b/so-vits-svc/models.py
@@ -0,0 +1,420 @@
+import copy
+import math
+import torch
+from torch import nn
+from torch.nn import functional as F
+
+import modules.attentions as attentions
+import modules.commons as commons
+import modules.modules as modules
+
+from torch.nn import Conv1d, ConvTranspose1d, AvgPool1d, Conv2d
+from torch.nn.utils import weight_norm, remove_weight_norm, spectral_norm
+
+import utils
+from modules.commons import init_weights, get_padding
+from vdecoder.hifigan.models import Generator
+from utils import f0_to_coarse
+
+class ResidualCouplingBlock(nn.Module):
+ def __init__(self,
+ channels,
+ hidden_channels,
+ kernel_size,
+ dilation_rate,
+ n_layers,
+ n_flows=4,
+ gin_channels=0):
+ super().__init__()
+ self.channels = channels
+ self.hidden_channels = hidden_channels
+ self.kernel_size = kernel_size
+ self.dilation_rate = dilation_rate
+ self.n_layers = n_layers
+ self.n_flows = n_flows
+ self.gin_channels = gin_channels
+
+ self.flows = nn.ModuleList()
+ for i in range(n_flows):
+ self.flows.append(modules.ResidualCouplingLayer(channels, hidden_channels, kernel_size, dilation_rate, n_layers, gin_channels=gin_channels, mean_only=True))
+ self.flows.append(modules.Flip())
+
+ def forward(self, x, x_mask, g=None, reverse=False):
+ if not reverse:
+ for flow in self.flows:
+ x, _ = flow(x, x_mask, g=g, reverse=reverse)
+ else:
+ for flow in reversed(self.flows):
+ x = flow(x, x_mask, g=g, reverse=reverse)
+ return x
+
+
+class Encoder(nn.Module):
+ def __init__(self,
+ in_channels,
+ out_channels,
+ hidden_channels,
+ kernel_size,
+ dilation_rate,
+ n_layers,
+ gin_channels=0):
+ super().__init__()
+ self.in_channels = in_channels
+ self.out_channels = out_channels
+ self.hidden_channels = hidden_channels
+ self.kernel_size = kernel_size
+ self.dilation_rate = dilation_rate
+ self.n_layers = n_layers
+ self.gin_channels = gin_channels
+
+ self.pre = nn.Conv1d(in_channels, hidden_channels, 1)
+ self.enc = modules.WN(hidden_channels, kernel_size, dilation_rate, n_layers, gin_channels=gin_channels)
+ self.proj = nn.Conv1d(hidden_channels, out_channels * 2, 1)
+
+ def forward(self, x, x_lengths, g=None):
+ # print(x.shape,x_lengths.shape)
+ x_mask = torch.unsqueeze(commons.sequence_mask(x_lengths, x.size(2)), 1).to(x.dtype)
+ x = self.pre(x) * x_mask
+ x = self.enc(x, x_mask, g=g)
+ stats = self.proj(x) * x_mask
+ m, logs = torch.split(stats, self.out_channels, dim=1)
+ z = (m + torch.randn_like(m) * torch.exp(logs)) * x_mask
+ return z, m, logs, x_mask
+
+
+class TextEncoder(nn.Module):
+ def __init__(self,
+ out_channels,
+ hidden_channels,
+ kernel_size,
+ n_layers,
+ gin_channels=0,
+ filter_channels=None,
+ n_heads=None,
+ p_dropout=None):
+ super().__init__()
+ self.out_channels = out_channels
+ self.hidden_channels = hidden_channels
+ self.kernel_size = kernel_size
+ self.n_layers = n_layers
+ self.gin_channels = gin_channels
+ self.proj = nn.Conv1d(hidden_channels, out_channels * 2, 1)
+ self.f0_emb = nn.Embedding(256, hidden_channels)
+
+ self.enc_ = attentions.Encoder(
+ hidden_channels,
+ filter_channels,
+ n_heads,
+ n_layers,
+ kernel_size,
+ p_dropout)
+
+ def forward(self, x, x_mask, f0=None, noice_scale=1):
+ x = x + self.f0_emb(f0).transpose(1,2)
+ x = self.enc_(x * x_mask, x_mask)
+ stats = self.proj(x) * x_mask
+ m, logs = torch.split(stats, self.out_channels, dim=1)
+ z = (m + torch.randn_like(m) * torch.exp(logs) * noice_scale) * x_mask
+
+ return z, m, logs, x_mask
+
+
+
+class DiscriminatorP(torch.nn.Module):
+ def __init__(self, period, kernel_size=5, stride=3, use_spectral_norm=False):
+ super(DiscriminatorP, self).__init__()
+ self.period = period
+ self.use_spectral_norm = use_spectral_norm
+ norm_f = weight_norm if use_spectral_norm == False else spectral_norm
+ self.convs = nn.ModuleList([
+ norm_f(Conv2d(1, 32, (kernel_size, 1), (stride, 1), padding=(get_padding(kernel_size, 1), 0))),
+ norm_f(Conv2d(32, 128, (kernel_size, 1), (stride, 1), padding=(get_padding(kernel_size, 1), 0))),
+ norm_f(Conv2d(128, 512, (kernel_size, 1), (stride, 1), padding=(get_padding(kernel_size, 1), 0))),
+ norm_f(Conv2d(512, 1024, (kernel_size, 1), (stride, 1), padding=(get_padding(kernel_size, 1), 0))),
+ norm_f(Conv2d(1024, 1024, (kernel_size, 1), 1, padding=(get_padding(kernel_size, 1), 0))),
+ ])
+ self.conv_post = norm_f(Conv2d(1024, 1, (3, 1), 1, padding=(1, 0)))
+
+ def forward(self, x):
+ fmap = []
+
+ # 1d to 2d
+ b, c, t = x.shape
+ if t % self.period != 0: # pad first
+ n_pad = self.period - (t % self.period)
+ x = F.pad(x, (0, n_pad), "reflect")
+ t = t + n_pad
+ x = x.view(b, c, t // self.period, self.period)
+
+ for l in self.convs:
+ x = l(x)
+ x = F.leaky_relu(x, modules.LRELU_SLOPE)
+ fmap.append(x)
+ x = self.conv_post(x)
+ fmap.append(x)
+ x = torch.flatten(x, 1, -1)
+
+ return x, fmap
+
+
+class DiscriminatorS(torch.nn.Module):
+ def __init__(self, use_spectral_norm=False):
+ super(DiscriminatorS, self).__init__()
+ norm_f = weight_norm if use_spectral_norm == False else spectral_norm
+ self.convs = nn.ModuleList([
+ norm_f(Conv1d(1, 16, 15, 1, padding=7)),
+ norm_f(Conv1d(16, 64, 41, 4, groups=4, padding=20)),
+ norm_f(Conv1d(64, 256, 41, 4, groups=16, padding=20)),
+ norm_f(Conv1d(256, 1024, 41, 4, groups=64, padding=20)),
+ norm_f(Conv1d(1024, 1024, 41, 4, groups=256, padding=20)),
+ norm_f(Conv1d(1024, 1024, 5, 1, padding=2)),
+ ])
+ self.conv_post = norm_f(Conv1d(1024, 1, 3, 1, padding=1))
+
+ def forward(self, x):
+ fmap = []
+
+ for l in self.convs:
+ x = l(x)
+ x = F.leaky_relu(x, modules.LRELU_SLOPE)
+ fmap.append(x)
+ x = self.conv_post(x)
+ fmap.append(x)
+ x = torch.flatten(x, 1, -1)
+
+ return x, fmap
+
+
+class MultiPeriodDiscriminator(torch.nn.Module):
+ def __init__(self, use_spectral_norm=False):
+ super(MultiPeriodDiscriminator, self).__init__()
+ periods = [2,3,5,7,11]
+
+ discs = [DiscriminatorS(use_spectral_norm=use_spectral_norm)]
+ discs = discs + [DiscriminatorP(i, use_spectral_norm=use_spectral_norm) for i in periods]
+ self.discriminators = nn.ModuleList(discs)
+
+ def forward(self, y, y_hat):
+ y_d_rs = []
+ y_d_gs = []
+ fmap_rs = []
+ fmap_gs = []
+ for i, d in enumerate(self.discriminators):
+ y_d_r, fmap_r = d(y)
+ y_d_g, fmap_g = d(y_hat)
+ y_d_rs.append(y_d_r)
+ y_d_gs.append(y_d_g)
+ fmap_rs.append(fmap_r)
+ fmap_gs.append(fmap_g)
+
+ return y_d_rs, y_d_gs, fmap_rs, fmap_gs
+
+
+class SpeakerEncoder(torch.nn.Module):
+ def __init__(self, mel_n_channels=80, model_num_layers=3, model_hidden_size=256, model_embedding_size=256):
+ super(SpeakerEncoder, self).__init__()
+ self.lstm = nn.LSTM(mel_n_channels, model_hidden_size, model_num_layers, batch_first=True)
+ self.linear = nn.Linear(model_hidden_size, model_embedding_size)
+ self.relu = nn.ReLU()
+
+ def forward(self, mels):
+ self.lstm.flatten_parameters()
+ _, (hidden, _) = self.lstm(mels)
+ embeds_raw = self.relu(self.linear(hidden[-1]))
+ return embeds_raw / torch.norm(embeds_raw, dim=1, keepdim=True)
+
+ def compute_partial_slices(self, total_frames, partial_frames, partial_hop):
+ mel_slices = []
+ for i in range(0, total_frames-partial_frames, partial_hop):
+ mel_range = torch.arange(i, i+partial_frames)
+ mel_slices.append(mel_range)
+
+ return mel_slices
+
+ def embed_utterance(self, mel, partial_frames=128, partial_hop=64):
+ mel_len = mel.size(1)
+ last_mel = mel[:,-partial_frames:]
+
+ if mel_len > partial_frames:
+ mel_slices = self.compute_partial_slices(mel_len, partial_frames, partial_hop)
+ mels = list(mel[:,s] for s in mel_slices)
+ mels.append(last_mel)
+ mels = torch.stack(tuple(mels), 0).squeeze(1)
+
+ with torch.no_grad():
+ partial_embeds = self(mels)
+ embed = torch.mean(partial_embeds, axis=0).unsqueeze(0)
+ #embed = embed / torch.linalg.norm(embed, 2)
+ else:
+ with torch.no_grad():
+ embed = self(last_mel)
+
+ return embed
+
+class F0Decoder(nn.Module):
+ def __init__(self,
+ out_channels,
+ hidden_channels,
+ filter_channels,
+ n_heads,
+ n_layers,
+ kernel_size,
+ p_dropout,
+ spk_channels=0):
+ super().__init__()
+ self.out_channels = out_channels
+ self.hidden_channels = hidden_channels
+ self.filter_channels = filter_channels
+ self.n_heads = n_heads
+ self.n_layers = n_layers
+ self.kernel_size = kernel_size
+ self.p_dropout = p_dropout
+ self.spk_channels = spk_channels
+
+ self.prenet = nn.Conv1d(hidden_channels, hidden_channels, 3, padding=1)
+ self.decoder = attentions.FFT(
+ hidden_channels,
+ filter_channels,
+ n_heads,
+ n_layers,
+ kernel_size,
+ p_dropout)
+ self.proj = nn.Conv1d(hidden_channels, out_channels, 1)
+ self.f0_prenet = nn.Conv1d(1, hidden_channels , 3, padding=1)
+ self.cond = nn.Conv1d(spk_channels, hidden_channels, 1)
+
+ def forward(self, x, norm_f0, x_mask, spk_emb=None):
+ x = torch.detach(x)
+ if (spk_emb is not None):
+ x = x + self.cond(spk_emb)
+ x += self.f0_prenet(norm_f0)
+ x = self.prenet(x) * x_mask
+ x = self.decoder(x * x_mask, x_mask)
+ x = self.proj(x) * x_mask
+ return x
+
+
+class SynthesizerTrn(nn.Module):
+ """
+ Synthesizer for Training
+ """
+
+ def __init__(self,
+ spec_channels,
+ segment_size,
+ inter_channels,
+ hidden_channels,
+ filter_channels,
+ n_heads,
+ n_layers,
+ kernel_size,
+ p_dropout,
+ resblock,
+ resblock_kernel_sizes,
+ resblock_dilation_sizes,
+ upsample_rates,
+ upsample_initial_channel,
+ upsample_kernel_sizes,
+ gin_channels,
+ ssl_dim,
+ n_speakers,
+ sampling_rate=44100,
+ **kwargs):
+
+ super().__init__()
+ self.spec_channels = spec_channels
+ self.inter_channels = inter_channels
+ self.hidden_channels = hidden_channels
+ self.filter_channels = filter_channels
+ self.n_heads = n_heads
+ self.n_layers = n_layers
+ self.kernel_size = kernel_size
+ self.p_dropout = p_dropout
+ self.resblock = resblock
+ self.resblock_kernel_sizes = resblock_kernel_sizes
+ self.resblock_dilation_sizes = resblock_dilation_sizes
+ self.upsample_rates = upsample_rates
+ self.upsample_initial_channel = upsample_initial_channel
+ self.upsample_kernel_sizes = upsample_kernel_sizes
+ self.segment_size = segment_size
+ self.gin_channels = gin_channels
+ self.ssl_dim = ssl_dim
+ self.emb_g = nn.Embedding(n_speakers, gin_channels)
+
+ self.pre = nn.Conv1d(ssl_dim, hidden_channels, kernel_size=5, padding=2)
+
+ self.enc_p = TextEncoder(
+ inter_channels,
+ hidden_channels,
+ filter_channels=filter_channels,
+ n_heads=n_heads,
+ n_layers=n_layers,
+ kernel_size=kernel_size,
+ p_dropout=p_dropout
+ )
+ hps = {
+ "sampling_rate": sampling_rate,
+ "inter_channels": inter_channels,
+ "resblock": resblock,
+ "resblock_kernel_sizes": resblock_kernel_sizes,
+ "resblock_dilation_sizes": resblock_dilation_sizes,
+ "upsample_rates": upsample_rates,
+ "upsample_initial_channel": upsample_initial_channel,
+ "upsample_kernel_sizes": upsample_kernel_sizes,
+ "gin_channels": gin_channels,
+ }
+ self.dec = Generator(h=hps)
+ self.enc_q = Encoder(spec_channels, inter_channels, hidden_channels, 5, 1, 16, gin_channels=gin_channels)
+ self.flow = ResidualCouplingBlock(inter_channels, hidden_channels, 5, 1, 4, gin_channels=gin_channels)
+ self.f0_decoder = F0Decoder(
+ 1,
+ hidden_channels,
+ filter_channels,
+ n_heads,
+ n_layers,
+ kernel_size,
+ p_dropout,
+ spk_channels=gin_channels
+ )
+ self.emb_uv = nn.Embedding(2, hidden_channels)
+
+ def forward(self, c, f0, uv, spec, g=None, c_lengths=None, spec_lengths=None):
+ g = self.emb_g(g).transpose(1,2)
+ # ssl prenet
+ x_mask = torch.unsqueeze(commons.sequence_mask(c_lengths, c.size(2)), 1).to(c.dtype)
+ x = self.pre(c) * x_mask + self.emb_uv(uv.long()).transpose(1,2)
+
+ # f0 predict
+ lf0 = 2595. * torch.log10(1. + f0.unsqueeze(1) / 700.) / 500
+ norm_lf0 = utils.normalize_f0(lf0, x_mask, uv)
+ pred_lf0 = self.f0_decoder(x, norm_lf0, x_mask, spk_emb=g)
+
+ # encoder
+ z_ptemp, m_p, logs_p, _ = self.enc_p(x, x_mask, f0=f0_to_coarse(f0))
+ z, m_q, logs_q, spec_mask = self.enc_q(spec, spec_lengths, g=g)
+
+ # flow
+ z_p = self.flow(z, spec_mask, g=g)
+ z_slice, pitch_slice, ids_slice = commons.rand_slice_segments_with_pitch(z, f0, spec_lengths, self.segment_size)
+
+ # nsf decoder
+ o = self.dec(z_slice, g=g, f0=pitch_slice)
+
+ return o, ids_slice, spec_mask, (z, z_p, m_p, logs_p, m_q, logs_q), pred_lf0, norm_lf0, lf0
+
+ def infer(self, c, f0, uv, g=None, noice_scale=0.35, predict_f0=False):
+ c_lengths = (torch.ones(c.size(0)) * c.size(-1)).to(c.device)
+ g = self.emb_g(g).transpose(1,2)
+ x_mask = torch.unsqueeze(commons.sequence_mask(c_lengths, c.size(2)), 1).to(c.dtype)
+ x = self.pre(c) * x_mask + self.emb_uv(uv.long()).transpose(1,2)
+
+ if predict_f0:
+ lf0 = 2595. * torch.log10(1. + f0.unsqueeze(1) / 700.) / 500
+ norm_lf0 = utils.normalize_f0(lf0, x_mask, uv, random_scale=False)
+ pred_lf0 = self.f0_decoder(x, norm_lf0, x_mask, spk_emb=g)
+ f0 = (700 * (torch.pow(10, pred_lf0 * 500 / 2595) - 1)).squeeze(1)
+
+ z_p, m_p, logs_p, c_mask = self.enc_p(x, x_mask, f0=f0_to_coarse(f0), noice_scale=noice_scale)
+ z = self.flow(z_p, c_mask, g=g, reverse=True)
+ o = self.dec(z * c_mask, g=g, f0=f0)
+ return o
diff --git a/so-vits-svc/modules/__init__.py b/so-vits-svc/modules/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/so-vits-svc/modules/attentions.py b/so-vits-svc/modules/attentions.py
new file mode 100644
index 0000000000000000000000000000000000000000..f9c11ca4a3acb86bf1abc04d9dcfa82a4ed4061f
--- /dev/null
+++ b/so-vits-svc/modules/attentions.py
@@ -0,0 +1,349 @@
+import copy
+import math
+import numpy as np
+import torch
+from torch import nn
+from torch.nn import functional as F
+
+import modules.commons as commons
+import modules.modules as modules
+from modules.modules import LayerNorm
+
+
+class FFT(nn.Module):
+ def __init__(self, hidden_channels, filter_channels, n_heads, n_layers=1, kernel_size=1, p_dropout=0.,
+ proximal_bias=False, proximal_init=True, **kwargs):
+ super().__init__()
+ self.hidden_channels = hidden_channels
+ self.filter_channels = filter_channels
+ self.n_heads = n_heads
+ self.n_layers = n_layers
+ self.kernel_size = kernel_size
+ self.p_dropout = p_dropout
+ self.proximal_bias = proximal_bias
+ self.proximal_init = proximal_init
+
+ self.drop = nn.Dropout(p_dropout)
+ self.self_attn_layers = nn.ModuleList()
+ self.norm_layers_0 = nn.ModuleList()
+ self.ffn_layers = nn.ModuleList()
+ self.norm_layers_1 = nn.ModuleList()
+ for i in range(self.n_layers):
+ self.self_attn_layers.append(
+ MultiHeadAttention(hidden_channels, hidden_channels, n_heads, p_dropout=p_dropout, proximal_bias=proximal_bias,
+ proximal_init=proximal_init))
+ self.norm_layers_0.append(LayerNorm(hidden_channels))
+ self.ffn_layers.append(
+ FFN(hidden_channels, hidden_channels, filter_channels, kernel_size, p_dropout=p_dropout, causal=True))
+ self.norm_layers_1.append(LayerNorm(hidden_channels))
+
+ def forward(self, x, x_mask):
+ """
+ x: decoder input
+ h: encoder output
+ """
+ self_attn_mask = commons.subsequent_mask(x_mask.size(2)).to(device=x.device, dtype=x.dtype)
+ x = x * x_mask
+ for i in range(self.n_layers):
+ y = self.self_attn_layers[i](x, x, self_attn_mask)
+ y = self.drop(y)
+ x = self.norm_layers_0[i](x + y)
+
+ y = self.ffn_layers[i](x, x_mask)
+ y = self.drop(y)
+ x = self.norm_layers_1[i](x + y)
+ x = x * x_mask
+ return x
+
+
+class Encoder(nn.Module):
+ def __init__(self, hidden_channels, filter_channels, n_heads, n_layers, kernel_size=1, p_dropout=0., window_size=4, **kwargs):
+ super().__init__()
+ self.hidden_channels = hidden_channels
+ self.filter_channels = filter_channels
+ self.n_heads = n_heads
+ self.n_layers = n_layers
+ self.kernel_size = kernel_size
+ self.p_dropout = p_dropout
+ self.window_size = window_size
+
+ self.drop = nn.Dropout(p_dropout)
+ self.attn_layers = nn.ModuleList()
+ self.norm_layers_1 = nn.ModuleList()
+ self.ffn_layers = nn.ModuleList()
+ self.norm_layers_2 = nn.ModuleList()
+ for i in range(self.n_layers):
+ self.attn_layers.append(MultiHeadAttention(hidden_channels, hidden_channels, n_heads, p_dropout=p_dropout, window_size=window_size))
+ self.norm_layers_1.append(LayerNorm(hidden_channels))
+ self.ffn_layers.append(FFN(hidden_channels, hidden_channels, filter_channels, kernel_size, p_dropout=p_dropout))
+ self.norm_layers_2.append(LayerNorm(hidden_channels))
+
+ def forward(self, x, x_mask):
+ attn_mask = x_mask.unsqueeze(2) * x_mask.unsqueeze(-1)
+ x = x * x_mask
+ for i in range(self.n_layers):
+ y = self.attn_layers[i](x, x, attn_mask)
+ y = self.drop(y)
+ x = self.norm_layers_1[i](x + y)
+
+ y = self.ffn_layers[i](x, x_mask)
+ y = self.drop(y)
+ x = self.norm_layers_2[i](x + y)
+ x = x * x_mask
+ return x
+
+
+class Decoder(nn.Module):
+ def __init__(self, hidden_channels, filter_channels, n_heads, n_layers, kernel_size=1, p_dropout=0., proximal_bias=False, proximal_init=True, **kwargs):
+ super().__init__()
+ self.hidden_channels = hidden_channels
+ self.filter_channels = filter_channels
+ self.n_heads = n_heads
+ self.n_layers = n_layers
+ self.kernel_size = kernel_size
+ self.p_dropout = p_dropout
+ self.proximal_bias = proximal_bias
+ self.proximal_init = proximal_init
+
+ self.drop = nn.Dropout(p_dropout)
+ self.self_attn_layers = nn.ModuleList()
+ self.norm_layers_0 = nn.ModuleList()
+ self.encdec_attn_layers = nn.ModuleList()
+ self.norm_layers_1 = nn.ModuleList()
+ self.ffn_layers = nn.ModuleList()
+ self.norm_layers_2 = nn.ModuleList()
+ for i in range(self.n_layers):
+ self.self_attn_layers.append(MultiHeadAttention(hidden_channels, hidden_channels, n_heads, p_dropout=p_dropout, proximal_bias=proximal_bias, proximal_init=proximal_init))
+ self.norm_layers_0.append(LayerNorm(hidden_channels))
+ self.encdec_attn_layers.append(MultiHeadAttention(hidden_channels, hidden_channels, n_heads, p_dropout=p_dropout))
+ self.norm_layers_1.append(LayerNorm(hidden_channels))
+ self.ffn_layers.append(FFN(hidden_channels, hidden_channels, filter_channels, kernel_size, p_dropout=p_dropout, causal=True))
+ self.norm_layers_2.append(LayerNorm(hidden_channels))
+
+ def forward(self, x, x_mask, h, h_mask):
+ """
+ x: decoder input
+ h: encoder output
+ """
+ self_attn_mask = commons.subsequent_mask(x_mask.size(2)).to(device=x.device, dtype=x.dtype)
+ encdec_attn_mask = h_mask.unsqueeze(2) * x_mask.unsqueeze(-1)
+ x = x * x_mask
+ for i in range(self.n_layers):
+ y = self.self_attn_layers[i](x, x, self_attn_mask)
+ y = self.drop(y)
+ x = self.norm_layers_0[i](x + y)
+
+ y = self.encdec_attn_layers[i](x, h, encdec_attn_mask)
+ y = self.drop(y)
+ x = self.norm_layers_1[i](x + y)
+
+ y = self.ffn_layers[i](x, x_mask)
+ y = self.drop(y)
+ x = self.norm_layers_2[i](x + y)
+ x = x * x_mask
+ return x
+
+
+class MultiHeadAttention(nn.Module):
+ def __init__(self, channels, out_channels, n_heads, p_dropout=0., window_size=None, heads_share=True, block_length=None, proximal_bias=False, proximal_init=False):
+ super().__init__()
+ assert channels % n_heads == 0
+
+ self.channels = channels
+ self.out_channels = out_channels
+ self.n_heads = n_heads
+ self.p_dropout = p_dropout
+ self.window_size = window_size
+ self.heads_share = heads_share
+ self.block_length = block_length
+ self.proximal_bias = proximal_bias
+ self.proximal_init = proximal_init
+ self.attn = None
+
+ self.k_channels = channels // n_heads
+ self.conv_q = nn.Conv1d(channels, channels, 1)
+ self.conv_k = nn.Conv1d(channels, channels, 1)
+ self.conv_v = nn.Conv1d(channels, channels, 1)
+ self.conv_o = nn.Conv1d(channels, out_channels, 1)
+ self.drop = nn.Dropout(p_dropout)
+
+ if window_size is not None:
+ n_heads_rel = 1 if heads_share else n_heads
+ rel_stddev = self.k_channels**-0.5
+ self.emb_rel_k = nn.Parameter(torch.randn(n_heads_rel, window_size * 2 + 1, self.k_channels) * rel_stddev)
+ self.emb_rel_v = nn.Parameter(torch.randn(n_heads_rel, window_size * 2 + 1, self.k_channels) * rel_stddev)
+
+ nn.init.xavier_uniform_(self.conv_q.weight)
+ nn.init.xavier_uniform_(self.conv_k.weight)
+ nn.init.xavier_uniform_(self.conv_v.weight)
+ if proximal_init:
+ with torch.no_grad():
+ self.conv_k.weight.copy_(self.conv_q.weight)
+ self.conv_k.bias.copy_(self.conv_q.bias)
+
+ def forward(self, x, c, attn_mask=None):
+ q = self.conv_q(x)
+ k = self.conv_k(c)
+ v = self.conv_v(c)
+
+ x, self.attn = self.attention(q, k, v, mask=attn_mask)
+
+ x = self.conv_o(x)
+ return x
+
+ def attention(self, query, key, value, mask=None):
+ # reshape [b, d, t] -> [b, n_h, t, d_k]
+ b, d, t_s, t_t = (*key.size(), query.size(2))
+ query = query.view(b, self.n_heads, self.k_channels, t_t).transpose(2, 3)
+ key = key.view(b, self.n_heads, self.k_channels, t_s).transpose(2, 3)
+ value = value.view(b, self.n_heads, self.k_channels, t_s).transpose(2, 3)
+
+ scores = torch.matmul(query / math.sqrt(self.k_channels), key.transpose(-2, -1))
+ if self.window_size is not None:
+ assert t_s == t_t, "Relative attention is only available for self-attention."
+ key_relative_embeddings = self._get_relative_embeddings(self.emb_rel_k, t_s)
+ rel_logits = self._matmul_with_relative_keys(query /math.sqrt(self.k_channels), key_relative_embeddings)
+ scores_local = self._relative_position_to_absolute_position(rel_logits)
+ scores = scores + scores_local
+ if self.proximal_bias:
+ assert t_s == t_t, "Proximal bias is only available for self-attention."
+ scores = scores + self._attention_bias_proximal(t_s).to(device=scores.device, dtype=scores.dtype)
+ if mask is not None:
+ scores = scores.masked_fill(mask == 0, -1e4)
+ if self.block_length is not None:
+ assert t_s == t_t, "Local attention is only available for self-attention."
+ block_mask = torch.ones_like(scores).triu(-self.block_length).tril(self.block_length)
+ scores = scores.masked_fill(block_mask == 0, -1e4)
+ p_attn = F.softmax(scores, dim=-1) # [b, n_h, t_t, t_s]
+ p_attn = self.drop(p_attn)
+ output = torch.matmul(p_attn, value)
+ if self.window_size is not None:
+ relative_weights = self._absolute_position_to_relative_position(p_attn)
+ value_relative_embeddings = self._get_relative_embeddings(self.emb_rel_v, t_s)
+ output = output + self._matmul_with_relative_values(relative_weights, value_relative_embeddings)
+ output = output.transpose(2, 3).contiguous().view(b, d, t_t) # [b, n_h, t_t, d_k] -> [b, d, t_t]
+ return output, p_attn
+
+ def _matmul_with_relative_values(self, x, y):
+ """
+ x: [b, h, l, m]
+ y: [h or 1, m, d]
+ ret: [b, h, l, d]
+ """
+ ret = torch.matmul(x, y.unsqueeze(0))
+ return ret
+
+ def _matmul_with_relative_keys(self, x, y):
+ """
+ x: [b, h, l, d]
+ y: [h or 1, m, d]
+ ret: [b, h, l, m]
+ """
+ ret = torch.matmul(x, y.unsqueeze(0).transpose(-2, -1))
+ return ret
+
+ def _get_relative_embeddings(self, relative_embeddings, length):
+ max_relative_position = 2 * self.window_size + 1
+ # Pad first before slice to avoid using cond ops.
+ pad_length = max(length - (self.window_size + 1), 0)
+ slice_start_position = max((self.window_size + 1) - length, 0)
+ slice_end_position = slice_start_position + 2 * length - 1
+ if pad_length > 0:
+ padded_relative_embeddings = F.pad(
+ relative_embeddings,
+ commons.convert_pad_shape([[0, 0], [pad_length, pad_length], [0, 0]]))
+ else:
+ padded_relative_embeddings = relative_embeddings
+ used_relative_embeddings = padded_relative_embeddings[:,slice_start_position:slice_end_position]
+ return used_relative_embeddings
+
+ def _relative_position_to_absolute_position(self, x):
+ """
+ x: [b, h, l, 2*l-1]
+ ret: [b, h, l, l]
+ """
+ batch, heads, length, _ = x.size()
+ # Concat columns of pad to shift from relative to absolute indexing.
+ x = F.pad(x, commons.convert_pad_shape([[0,0],[0,0],[0,0],[0,1]]))
+
+ # Concat extra elements so to add up to shape (len+1, 2*len-1).
+ x_flat = x.view([batch, heads, length * 2 * length])
+ x_flat = F.pad(x_flat, commons.convert_pad_shape([[0,0],[0,0],[0,length-1]]))
+
+ # Reshape and slice out the padded elements.
+ x_final = x_flat.view([batch, heads, length+1, 2*length-1])[:, :, :length, length-1:]
+ return x_final
+
+ def _absolute_position_to_relative_position(self, x):
+ """
+ x: [b, h, l, l]
+ ret: [b, h, l, 2*l-1]
+ """
+ batch, heads, length, _ = x.size()
+ # padd along column
+ x = F.pad(x, commons.convert_pad_shape([[0, 0], [0, 0], [0, 0], [0, length-1]]))
+ x_flat = x.view([batch, heads, length**2 + length*(length -1)])
+ # add 0's in the beginning that will skew the elements after reshape
+ x_flat = F.pad(x_flat, commons.convert_pad_shape([[0, 0], [0, 0], [length, 0]]))
+ x_final = x_flat.view([batch, heads, length, 2*length])[:,:,:,1:]
+ return x_final
+
+ def _attention_bias_proximal(self, length):
+ """Bias for self-attention to encourage attention to close positions.
+ Args:
+ length: an integer scalar.
+ Returns:
+ a Tensor with shape [1, 1, length, length]
+ """
+ r = torch.arange(length, dtype=torch.float32)
+ diff = torch.unsqueeze(r, 0) - torch.unsqueeze(r, 1)
+ return torch.unsqueeze(torch.unsqueeze(-torch.log1p(torch.abs(diff)), 0), 0)
+
+
+class FFN(nn.Module):
+ def __init__(self, in_channels, out_channels, filter_channels, kernel_size, p_dropout=0., activation=None, causal=False):
+ super().__init__()
+ self.in_channels = in_channels
+ self.out_channels = out_channels
+ self.filter_channels = filter_channels
+ self.kernel_size = kernel_size
+ self.p_dropout = p_dropout
+ self.activation = activation
+ self.causal = causal
+
+ if causal:
+ self.padding = self._causal_padding
+ else:
+ self.padding = self._same_padding
+
+ self.conv_1 = nn.Conv1d(in_channels, filter_channels, kernel_size)
+ self.conv_2 = nn.Conv1d(filter_channels, out_channels, kernel_size)
+ self.drop = nn.Dropout(p_dropout)
+
+ def forward(self, x, x_mask):
+ x = self.conv_1(self.padding(x * x_mask))
+ if self.activation == "gelu":
+ x = x * torch.sigmoid(1.702 * x)
+ else:
+ x = torch.relu(x)
+ x = self.drop(x)
+ x = self.conv_2(self.padding(x * x_mask))
+ return x * x_mask
+
+ def _causal_padding(self, x):
+ if self.kernel_size == 1:
+ return x
+ pad_l = self.kernel_size - 1
+ pad_r = 0
+ padding = [[0, 0], [0, 0], [pad_l, pad_r]]
+ x = F.pad(x, commons.convert_pad_shape(padding))
+ return x
+
+ def _same_padding(self, x):
+ if self.kernel_size == 1:
+ return x
+ pad_l = (self.kernel_size - 1) // 2
+ pad_r = self.kernel_size // 2
+ padding = [[0, 0], [0, 0], [pad_l, pad_r]]
+ x = F.pad(x, commons.convert_pad_shape(padding))
+ return x
diff --git a/so-vits-svc/modules/commons.py b/so-vits-svc/modules/commons.py
new file mode 100644
index 0000000000000000000000000000000000000000..074888006392e956ce204d8368362dbb2cd4e304
--- /dev/null
+++ b/so-vits-svc/modules/commons.py
@@ -0,0 +1,188 @@
+import math
+import numpy as np
+import torch
+from torch import nn
+from torch.nn import functional as F
+
+def slice_pitch_segments(x, ids_str, segment_size=4):
+ ret = torch.zeros_like(x[:, :segment_size])
+ for i in range(x.size(0)):
+ idx_str = ids_str[i]
+ idx_end = idx_str + segment_size
+ ret[i] = x[i, idx_str:idx_end]
+ return ret
+
+def rand_slice_segments_with_pitch(x, pitch, x_lengths=None, segment_size=4):
+ b, d, t = x.size()
+ if x_lengths is None:
+ x_lengths = t
+ ids_str_max = x_lengths - segment_size + 1
+ ids_str = (torch.rand([b]).to(device=x.device) * ids_str_max).to(dtype=torch.long)
+ ret = slice_segments(x, ids_str, segment_size)
+ ret_pitch = slice_pitch_segments(pitch, ids_str, segment_size)
+ return ret, ret_pitch, ids_str
+
+def init_weights(m, mean=0.0, std=0.01):
+ classname = m.__class__.__name__
+ if classname.find("Conv") != -1:
+ m.weight.data.normal_(mean, std)
+
+
+def get_padding(kernel_size, dilation=1):
+ return int((kernel_size*dilation - dilation)/2)
+
+
+def convert_pad_shape(pad_shape):
+ l = pad_shape[::-1]
+ pad_shape = [item for sublist in l for item in sublist]
+ return pad_shape
+
+
+def intersperse(lst, item):
+ result = [item] * (len(lst) * 2 + 1)
+ result[1::2] = lst
+ return result
+
+
+def kl_divergence(m_p, logs_p, m_q, logs_q):
+ """KL(P||Q)"""
+ kl = (logs_q - logs_p) - 0.5
+ kl += 0.5 * (torch.exp(2. * logs_p) + ((m_p - m_q)**2)) * torch.exp(-2. * logs_q)
+ return kl
+
+
+def rand_gumbel(shape):
+ """Sample from the Gumbel distribution, protect from overflows."""
+ uniform_samples = torch.rand(shape) * 0.99998 + 0.00001
+ return -torch.log(-torch.log(uniform_samples))
+
+
+def rand_gumbel_like(x):
+ g = rand_gumbel(x.size()).to(dtype=x.dtype, device=x.device)
+ return g
+
+
+def slice_segments(x, ids_str, segment_size=4):
+ ret = torch.zeros_like(x[:, :, :segment_size])
+ for i in range(x.size(0)):
+ idx_str = ids_str[i]
+ idx_end = idx_str + segment_size
+ ret[i] = x[i, :, idx_str:idx_end]
+ return ret
+
+
+def rand_slice_segments(x, x_lengths=None, segment_size=4):
+ b, d, t = x.size()
+ if x_lengths is None:
+ x_lengths = t
+ ids_str_max = x_lengths - segment_size + 1
+ ids_str = (torch.rand([b]).to(device=x.device) * ids_str_max).to(dtype=torch.long)
+ ret = slice_segments(x, ids_str, segment_size)
+ return ret, ids_str
+
+
+def rand_spec_segments(x, x_lengths=None, segment_size=4):
+ b, d, t = x.size()
+ if x_lengths is None:
+ x_lengths = t
+ ids_str_max = x_lengths - segment_size
+ ids_str = (torch.rand([b]).to(device=x.device) * ids_str_max).to(dtype=torch.long)
+ ret = slice_segments(x, ids_str, segment_size)
+ return ret, ids_str
+
+
+def get_timing_signal_1d(
+ length, channels, min_timescale=1.0, max_timescale=1.0e4):
+ position = torch.arange(length, dtype=torch.float)
+ num_timescales = channels // 2
+ log_timescale_increment = (
+ math.log(float(max_timescale) / float(min_timescale)) /
+ (num_timescales - 1))
+ inv_timescales = min_timescale * torch.exp(
+ torch.arange(num_timescales, dtype=torch.float) * -log_timescale_increment)
+ scaled_time = position.unsqueeze(0) * inv_timescales.unsqueeze(1)
+ signal = torch.cat([torch.sin(scaled_time), torch.cos(scaled_time)], 0)
+ signal = F.pad(signal, [0, 0, 0, channels % 2])
+ signal = signal.view(1, channels, length)
+ return signal
+
+
+def add_timing_signal_1d(x, min_timescale=1.0, max_timescale=1.0e4):
+ b, channels, length = x.size()
+ signal = get_timing_signal_1d(length, channels, min_timescale, max_timescale)
+ return x + signal.to(dtype=x.dtype, device=x.device)
+
+
+def cat_timing_signal_1d(x, min_timescale=1.0, max_timescale=1.0e4, axis=1):
+ b, channels, length = x.size()
+ signal = get_timing_signal_1d(length, channels, min_timescale, max_timescale)
+ return torch.cat([x, signal.to(dtype=x.dtype, device=x.device)], axis)
+
+
+def subsequent_mask(length):
+ mask = torch.tril(torch.ones(length, length)).unsqueeze(0).unsqueeze(0)
+ return mask
+
+
+@torch.jit.script
+def fused_add_tanh_sigmoid_multiply(input_a, input_b, n_channels):
+ n_channels_int = n_channels[0]
+ in_act = input_a + input_b
+ t_act = torch.tanh(in_act[:, :n_channels_int, :])
+ s_act = torch.sigmoid(in_act[:, n_channels_int:, :])
+ acts = t_act * s_act
+ return acts
+
+
+def convert_pad_shape(pad_shape):
+ l = pad_shape[::-1]
+ pad_shape = [item for sublist in l for item in sublist]
+ return pad_shape
+
+
+def shift_1d(x):
+ x = F.pad(x, convert_pad_shape([[0, 0], [0, 0], [1, 0]]))[:, :, :-1]
+ return x
+
+
+def sequence_mask(length, max_length=None):
+ if max_length is None:
+ max_length = length.max()
+ x = torch.arange(max_length, dtype=length.dtype, device=length.device)
+ return x.unsqueeze(0) < length.unsqueeze(1)
+
+
+def generate_path(duration, mask):
+ """
+ duration: [b, 1, t_x]
+ mask: [b, 1, t_y, t_x]
+ """
+ device = duration.device
+
+ b, _, t_y, t_x = mask.shape
+ cum_duration = torch.cumsum(duration, -1)
+
+ cum_duration_flat = cum_duration.view(b * t_x)
+ path = sequence_mask(cum_duration_flat, t_y).to(mask.dtype)
+ path = path.view(b, t_x, t_y)
+ path = path - F.pad(path, convert_pad_shape([[0, 0], [1, 0], [0, 0]]))[:, :-1]
+ path = path.unsqueeze(1).transpose(2,3) * mask
+ return path
+
+
+def clip_grad_value_(parameters, clip_value, norm_type=2):
+ if isinstance(parameters, torch.Tensor):
+ parameters = [parameters]
+ parameters = list(filter(lambda p: p.grad is not None, parameters))
+ norm_type = float(norm_type)
+ if clip_value is not None:
+ clip_value = float(clip_value)
+
+ total_norm = 0
+ for p in parameters:
+ param_norm = p.grad.data.norm(norm_type)
+ total_norm += param_norm.item() ** norm_type
+ if clip_value is not None:
+ p.grad.data.clamp_(min=-clip_value, max=clip_value)
+ total_norm = total_norm ** (1. / norm_type)
+ return total_norm
diff --git a/so-vits-svc/modules/crepe.py b/so-vits-svc/modules/crepe.py
new file mode 100644
index 0000000000000000000000000000000000000000..b58c1680d02fef54497c36bd47a36776cc7f6af5
--- /dev/null
+++ b/so-vits-svc/modules/crepe.py
@@ -0,0 +1,331 @@
+from typing import Optional,Union
+try:
+ from typing import Literal
+except Exception as e:
+ from typing_extensions import Literal
+import numpy as np
+import torch
+import torchcrepe
+from torch import nn
+from torch.nn import functional as F
+import scipy
+
+#from:https://github.com/fishaudio/fish-diffusion
+
+def repeat_expand(
+ content: Union[torch.Tensor, np.ndarray], target_len: int, mode: str = "nearest"
+):
+ """Repeat content to target length.
+ This is a wrapper of torch.nn.functional.interpolate.
+
+ Args:
+ content (torch.Tensor): tensor
+ target_len (int): target length
+ mode (str, optional): interpolation mode. Defaults to "nearest".
+
+ Returns:
+ torch.Tensor: tensor
+ """
+
+ ndim = content.ndim
+
+ if content.ndim == 1:
+ content = content[None, None]
+ elif content.ndim == 2:
+ content = content[None]
+
+ assert content.ndim == 3
+
+ is_np = isinstance(content, np.ndarray)
+ if is_np:
+ content = torch.from_numpy(content)
+
+ results = torch.nn.functional.interpolate(content, size=target_len, mode=mode)
+
+ if is_np:
+ results = results.numpy()
+
+ if ndim == 1:
+ return results[0, 0]
+ elif ndim == 2:
+ return results[0]
+
+
+class BasePitchExtractor:
+ def __init__(
+ self,
+ hop_length: int = 512,
+ f0_min: float = 50.0,
+ f0_max: float = 1100.0,
+ keep_zeros: bool = True,
+ ):
+ """Base pitch extractor.
+
+ Args:
+ hop_length (int, optional): Hop length. Defaults to 512.
+ f0_min (float, optional): Minimum f0. Defaults to 50.0.
+ f0_max (float, optional): Maximum f0. Defaults to 1100.0.
+ keep_zeros (bool, optional): Whether keep zeros in pitch. Defaults to True.
+ """
+
+ self.hop_length = hop_length
+ self.f0_min = f0_min
+ self.f0_max = f0_max
+ self.keep_zeros = keep_zeros
+
+ def __call__(self, x, sampling_rate=44100, pad_to=None):
+ raise NotImplementedError("BasePitchExtractor is not callable.")
+
+ def post_process(self, x, sampling_rate, f0, pad_to):
+ if isinstance(f0, np.ndarray):
+ f0 = torch.from_numpy(f0).float().to(x.device)
+
+ if pad_to is None:
+ return f0
+
+ f0 = repeat_expand(f0, pad_to)
+
+ if self.keep_zeros:
+ return f0
+
+ vuv_vector = torch.zeros_like(f0)
+ vuv_vector[f0 > 0.0] = 1.0
+ vuv_vector[f0 <= 0.0] = 0.0
+
+ # Remove 0 frequency and apply linear interpolation
+ nzindex = torch.nonzero(f0).squeeze()
+ f0 = torch.index_select(f0, dim=0, index=nzindex).cpu().numpy()
+ time_org = self.hop_length / sampling_rate * nzindex.cpu().numpy()
+ time_frame = np.arange(pad_to) * self.hop_length / sampling_rate
+
+ if f0.shape[0] <= 0:
+ return torch.zeros(pad_to, dtype=torch.float, device=x.device),torch.zeros(pad_to, dtype=torch.float, device=x.device)
+
+ if f0.shape[0] == 1:
+ return torch.ones(pad_to, dtype=torch.float, device=x.device) * f0[0],torch.ones(pad_to, dtype=torch.float, device=x.device)
+
+ # Probably can be rewritten with torch?
+ f0 = np.interp(time_frame, time_org, f0, left=f0[0], right=f0[-1])
+ vuv_vector = vuv_vector.cpu().numpy()
+ vuv_vector = np.ceil(scipy.ndimage.zoom(vuv_vector,pad_to/len(vuv_vector),order = 0))
+
+ return f0,vuv_vector
+
+
+class MaskedAvgPool1d(nn.Module):
+ def __init__(
+ self, kernel_size: int, stride: Optional[int] = None, padding: Optional[int] = 0
+ ):
+ """An implementation of mean pooling that supports masked values.
+
+ Args:
+ kernel_size (int): The size of the median pooling window.
+ stride (int, optional): The stride of the median pooling window. Defaults to None.
+ padding (int, optional): The padding of the median pooling window. Defaults to 0.
+ """
+
+ super(MaskedAvgPool1d, self).__init__()
+ self.kernel_size = kernel_size
+ self.stride = stride or kernel_size
+ self.padding = padding
+
+ def forward(self, x, mask=None):
+ ndim = x.dim()
+ if ndim == 2:
+ x = x.unsqueeze(1)
+
+ assert (
+ x.dim() == 3
+ ), "Input tensor must have 2 or 3 dimensions (batch_size, channels, width)"
+
+ # Apply the mask by setting masked elements to zero, or make NaNs zero
+ if mask is None:
+ mask = ~torch.isnan(x)
+
+ # Ensure mask has the same shape as the input tensor
+ assert x.shape == mask.shape, "Input tensor and mask must have the same shape"
+
+ masked_x = torch.where(mask, x, torch.zeros_like(x))
+ # Create a ones kernel with the same number of channels as the input tensor
+ ones_kernel = torch.ones(x.size(1), 1, self.kernel_size, device=x.device)
+
+ # Perform sum pooling
+ sum_pooled = nn.functional.conv1d(
+ masked_x,
+ ones_kernel,
+ stride=self.stride,
+ padding=self.padding,
+ groups=x.size(1),
+ )
+
+ # Count the non-masked (valid) elements in each pooling window
+ valid_count = nn.functional.conv1d(
+ mask.float(),
+ ones_kernel,
+ stride=self.stride,
+ padding=self.padding,
+ groups=x.size(1),
+ )
+ valid_count = valid_count.clamp(min=1) # Avoid division by zero
+
+ # Perform masked average pooling
+ avg_pooled = sum_pooled / valid_count
+
+ # Fill zero values with NaNs
+ avg_pooled[avg_pooled == 0] = float("nan")
+
+ if ndim == 2:
+ return avg_pooled.squeeze(1)
+
+ return avg_pooled
+
+
+class MaskedMedianPool1d(nn.Module):
+ def __init__(
+ self, kernel_size: int, stride: Optional[int] = None, padding: Optional[int] = 0
+ ):
+ """An implementation of median pooling that supports masked values.
+
+ This implementation is inspired by the median pooling implementation in
+ https://gist.github.com/rwightman/f2d3849281624be7c0f11c85c87c1598
+
+ Args:
+ kernel_size (int): The size of the median pooling window.
+ stride (int, optional): The stride of the median pooling window. Defaults to None.
+ padding (int, optional): The padding of the median pooling window. Defaults to 0.
+ """
+
+ super(MaskedMedianPool1d, self).__init__()
+ self.kernel_size = kernel_size
+ self.stride = stride or kernel_size
+ self.padding = padding
+
+ def forward(self, x, mask=None):
+ ndim = x.dim()
+ if ndim == 2:
+ x = x.unsqueeze(1)
+
+ assert (
+ x.dim() == 3
+ ), "Input tensor must have 2 or 3 dimensions (batch_size, channels, width)"
+
+ if mask is None:
+ mask = ~torch.isnan(x)
+
+ assert x.shape == mask.shape, "Input tensor and mask must have the same shape"
+
+ masked_x = torch.where(mask, x, torch.zeros_like(x))
+
+ x = F.pad(masked_x, (self.padding, self.padding), mode="reflect")
+ mask = F.pad(
+ mask.float(), (self.padding, self.padding), mode="constant", value=0
+ )
+
+ x = x.unfold(2, self.kernel_size, self.stride)
+ mask = mask.unfold(2, self.kernel_size, self.stride)
+
+ x = x.contiguous().view(x.size()[:3] + (-1,))
+ mask = mask.contiguous().view(mask.size()[:3] + (-1,)).to(x.device)
+
+ # Combine the mask with the input tensor
+ #x_masked = torch.where(mask.bool(), x, torch.fill_(torch.zeros_like(x),float("inf")))
+ x_masked = torch.where(mask.bool(), x, torch.FloatTensor([float("inf")]).to(x.device))
+
+ # Sort the masked tensor along the last dimension
+ x_sorted, _ = torch.sort(x_masked, dim=-1)
+
+ # Compute the count of non-masked (valid) values
+ valid_count = mask.sum(dim=-1)
+
+ # Calculate the index of the median value for each pooling window
+ median_idx = (torch.div((valid_count - 1), 2, rounding_mode='trunc')).clamp(min=0)
+
+ # Gather the median values using the calculated indices
+ median_pooled = x_sorted.gather(-1, median_idx.unsqueeze(-1).long()).squeeze(-1)
+
+ # Fill infinite values with NaNs
+ median_pooled[torch.isinf(median_pooled)] = float("nan")
+
+ if ndim == 2:
+ return median_pooled.squeeze(1)
+
+ return median_pooled
+
+
+class CrepePitchExtractor(BasePitchExtractor):
+ def __init__(
+ self,
+ hop_length: int = 512,
+ f0_min: float = 50.0,
+ f0_max: float = 1100.0,
+ threshold: float = 0.05,
+ keep_zeros: bool = False,
+ device = None,
+ model: Literal["full", "tiny"] = "full",
+ use_fast_filters: bool = True,
+ ):
+ super().__init__(hop_length, f0_min, f0_max, keep_zeros)
+
+ self.threshold = threshold
+ self.model = model
+ self.use_fast_filters = use_fast_filters
+ self.hop_length = hop_length
+ if device is None:
+ self.dev = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+ else:
+ self.dev = torch.device(device)
+ if self.use_fast_filters:
+ self.median_filter = MaskedMedianPool1d(3, 1, 1).to(device)
+ self.mean_filter = MaskedAvgPool1d(3, 1, 1).to(device)
+
+ def __call__(self, x, sampling_rate=44100, pad_to=None):
+ """Extract pitch using crepe.
+
+
+ Args:
+ x (torch.Tensor): Audio signal, shape (1, T).
+ sampling_rate (int, optional): Sampling rate. Defaults to 44100.
+ pad_to (int, optional): Pad to length. Defaults to None.
+
+ Returns:
+ torch.Tensor: Pitch, shape (T // hop_length,).
+ """
+
+ assert x.ndim == 2, f"Expected 2D tensor, got {x.ndim}D tensor."
+ assert x.shape[0] == 1, f"Expected 1 channel, got {x.shape[0]} channels."
+
+ x = x.to(self.dev)
+ f0, pd = torchcrepe.predict(
+ x,
+ sampling_rate,
+ self.hop_length,
+ self.f0_min,
+ self.f0_max,
+ pad=True,
+ model=self.model,
+ batch_size=1024,
+ device=x.device,
+ return_periodicity=True,
+ )
+
+ # Filter, remove silence, set uv threshold, refer to the original warehouse readme
+ if self.use_fast_filters:
+ pd = self.median_filter(pd)
+ else:
+ pd = torchcrepe.filter.median(pd, 3)
+
+ pd = torchcrepe.threshold.Silence(-60.0)(pd, x, sampling_rate, 512)
+ f0 = torchcrepe.threshold.At(self.threshold)(f0, pd)
+
+ if self.use_fast_filters:
+ f0 = self.mean_filter(f0)
+ else:
+ f0 = torchcrepe.filter.mean(f0, 3)
+
+ f0 = torch.where(torch.isnan(f0), torch.full_like(f0, 0), f0)[0]
+
+ if torch.all(f0 == 0):
+ rtn = f0.cpu().numpy() if pad_to==None else np.zeros(pad_to)
+ return rtn,rtn
+
+ return self.post_process(x, sampling_rate, f0, pad_to)
diff --git a/so-vits-svc/modules/enhancer.py b/so-vits-svc/modules/enhancer.py
new file mode 100644
index 0000000000000000000000000000000000000000..37676311f7d8dc4ddc2a5244dedc27b2437e04f5
--- /dev/null
+++ b/so-vits-svc/modules/enhancer.py
@@ -0,0 +1,105 @@
+import numpy as np
+import torch
+import torch.nn.functional as F
+from vdecoder.nsf_hifigan.nvSTFT import STFT
+from vdecoder.nsf_hifigan.models import load_model
+from torchaudio.transforms import Resample
+
+class Enhancer:
+ def __init__(self, enhancer_type, enhancer_ckpt, device=None):
+ if device is None:
+ device = 'cuda' if torch.cuda.is_available() else 'cpu'
+ self.device = device
+
+ if enhancer_type == 'nsf-hifigan':
+ self.enhancer = NsfHifiGAN(enhancer_ckpt, device=self.device)
+ else:
+ raise ValueError(f" [x] Unknown enhancer: {enhancer_type}")
+
+ self.resample_kernel = {}
+ self.enhancer_sample_rate = self.enhancer.sample_rate()
+ self.enhancer_hop_size = self.enhancer.hop_size()
+
+ def enhance(self,
+ audio, # 1, T
+ sample_rate,
+ f0, # 1, n_frames, 1
+ hop_size,
+ adaptive_key = 0,
+ silence_front = 0
+ ):
+ # enhancer start time
+ start_frame = int(silence_front * sample_rate / hop_size)
+ real_silence_front = start_frame * hop_size / sample_rate
+ audio = audio[:, int(np.round(real_silence_front * sample_rate)) : ]
+ f0 = f0[: , start_frame :, :]
+
+ # adaptive parameters
+ adaptive_factor = 2 ** ( -adaptive_key / 12)
+ adaptive_sample_rate = 100 * int(np.round(self.enhancer_sample_rate / adaptive_factor / 100))
+ real_factor = self.enhancer_sample_rate / adaptive_sample_rate
+
+ # resample the ddsp output
+ if sample_rate == adaptive_sample_rate:
+ audio_res = audio
+ else:
+ key_str = str(sample_rate) + str(adaptive_sample_rate)
+ if key_str not in self.resample_kernel:
+ self.resample_kernel[key_str] = Resample(sample_rate, adaptive_sample_rate, lowpass_filter_width = 128).to(self.device)
+ audio_res = self.resample_kernel[key_str](audio)
+
+ n_frames = int(audio_res.size(-1) // self.enhancer_hop_size + 1)
+
+ # resample f0
+ f0_np = f0.squeeze(0).squeeze(-1).cpu().numpy()
+ f0_np *= real_factor
+ time_org = (hop_size / sample_rate) * np.arange(len(f0_np)) / real_factor
+ time_frame = (self.enhancer_hop_size / self.enhancer_sample_rate) * np.arange(n_frames)
+ f0_res = np.interp(time_frame, time_org, f0_np, left=f0_np[0], right=f0_np[-1])
+ f0_res = torch.from_numpy(f0_res).unsqueeze(0).float().to(self.device) # 1, n_frames
+
+ # enhance
+ enhanced_audio, enhancer_sample_rate = self.enhancer(audio_res, f0_res)
+
+ # resample the enhanced output
+ if adaptive_factor != 0:
+ key_str = str(adaptive_sample_rate) + str(enhancer_sample_rate)
+ if key_str not in self.resample_kernel:
+ self.resample_kernel[key_str] = Resample(adaptive_sample_rate, enhancer_sample_rate, lowpass_filter_width = 128).to(self.device)
+ enhanced_audio = self.resample_kernel[key_str](enhanced_audio)
+
+ # pad the silence frames
+ if start_frame > 0:
+ enhanced_audio = F.pad(enhanced_audio, (int(np.round(enhancer_sample_rate * real_silence_front)), 0))
+
+ return enhanced_audio, enhancer_sample_rate
+
+
+class NsfHifiGAN(torch.nn.Module):
+ def __init__(self, model_path, device=None):
+ super().__init__()
+ if device is None:
+ device = 'cuda' if torch.cuda.is_available() else 'cpu'
+ self.device = device
+ print('| Load HifiGAN: ', model_path)
+ self.model, self.h = load_model(model_path, device=self.device)
+
+ def sample_rate(self):
+ return self.h.sampling_rate
+
+ def hop_size(self):
+ return self.h.hop_size
+
+ def forward(self, audio, f0):
+ stft = STFT(
+ self.h.sampling_rate,
+ self.h.num_mels,
+ self.h.n_fft,
+ self.h.win_size,
+ self.h.hop_size,
+ self.h.fmin,
+ self.h.fmax)
+ with torch.no_grad():
+ mel = stft.get_mel(audio)
+ enhanced_audio = self.model(mel, f0[:,:mel.size(-1)]).view(-1)
+ return enhanced_audio, self.h.sampling_rate
\ No newline at end of file
diff --git a/so-vits-svc/modules/losses.py b/so-vits-svc/modules/losses.py
new file mode 100644
index 0000000000000000000000000000000000000000..cd21799eccde350c3aac0bdd661baf96ed220147
--- /dev/null
+++ b/so-vits-svc/modules/losses.py
@@ -0,0 +1,61 @@
+import torch
+from torch.nn import functional as F
+
+import modules.commons as commons
+
+
+def feature_loss(fmap_r, fmap_g):
+ loss = 0
+ for dr, dg in zip(fmap_r, fmap_g):
+ for rl, gl in zip(dr, dg):
+ rl = rl.float().detach()
+ gl = gl.float()
+ loss += torch.mean(torch.abs(rl - gl))
+
+ return loss * 2
+
+
+def discriminator_loss(disc_real_outputs, disc_generated_outputs):
+ loss = 0
+ r_losses = []
+ g_losses = []
+ for dr, dg in zip(disc_real_outputs, disc_generated_outputs):
+ dr = dr.float()
+ dg = dg.float()
+ r_loss = torch.mean((1-dr)**2)
+ g_loss = torch.mean(dg**2)
+ loss += (r_loss + g_loss)
+ r_losses.append(r_loss.item())
+ g_losses.append(g_loss.item())
+
+ return loss, r_losses, g_losses
+
+
+def generator_loss(disc_outputs):
+ loss = 0
+ gen_losses = []
+ for dg in disc_outputs:
+ dg = dg.float()
+ l = torch.mean((1-dg)**2)
+ gen_losses.append(l)
+ loss += l
+
+ return loss, gen_losses
+
+
+def kl_loss(z_p, logs_q, m_p, logs_p, z_mask):
+ """
+ z_p, logs_q: [b, h, t_t]
+ m_p, logs_p: [b, h, t_t]
+ """
+ z_p = z_p.float()
+ logs_q = logs_q.float()
+ m_p = m_p.float()
+ logs_p = logs_p.float()
+ z_mask = z_mask.float()
+ #print(logs_p)
+ kl = logs_p - logs_q - 0.5
+ kl += 0.5 * ((z_p - m_p)**2) * torch.exp(-2. * logs_p)
+ kl = torch.sum(kl * z_mask)
+ l = kl / torch.sum(z_mask)
+ return l
diff --git a/so-vits-svc/modules/mel_processing.py b/so-vits-svc/modules/mel_processing.py
new file mode 100644
index 0000000000000000000000000000000000000000..99c5b35beb83f3b288af0fac5b49ebf2c69f062c
--- /dev/null
+++ b/so-vits-svc/modules/mel_processing.py
@@ -0,0 +1,112 @@
+import math
+import os
+import random
+import torch
+from torch import nn
+import torch.nn.functional as F
+import torch.utils.data
+import numpy as np
+import librosa
+import librosa.util as librosa_util
+from librosa.util import normalize, pad_center, tiny
+from scipy.signal import get_window
+from scipy.io.wavfile import read
+from librosa.filters import mel as librosa_mel_fn
+
+MAX_WAV_VALUE = 32768.0
+
+
+def dynamic_range_compression_torch(x, C=1, clip_val=1e-5):
+ """
+ PARAMS
+ ------
+ C: compression factor
+ """
+ return torch.log(torch.clamp(x, min=clip_val) * C)
+
+
+def dynamic_range_decompression_torch(x, C=1):
+ """
+ PARAMS
+ ------
+ C: compression factor used to compress
+ """
+ return torch.exp(x) / C
+
+
+def spectral_normalize_torch(magnitudes):
+ output = dynamic_range_compression_torch(magnitudes)
+ return output
+
+
+def spectral_de_normalize_torch(magnitudes):
+ output = dynamic_range_decompression_torch(magnitudes)
+ return output
+
+
+mel_basis = {}
+hann_window = {}
+
+
+def spectrogram_torch(y, n_fft, sampling_rate, hop_size, win_size, center=False):
+ if torch.min(y) < -1.:
+ print('min value is ', torch.min(y))
+ if torch.max(y) > 1.:
+ print('max value is ', torch.max(y))
+
+ global hann_window
+ dtype_device = str(y.dtype) + '_' + str(y.device)
+ wnsize_dtype_device = str(win_size) + '_' + dtype_device
+ if wnsize_dtype_device not in hann_window:
+ hann_window[wnsize_dtype_device] = torch.hann_window(win_size).to(dtype=y.dtype, device=y.device)
+
+ y = torch.nn.functional.pad(y.unsqueeze(1), (int((n_fft-hop_size)/2), int((n_fft-hop_size)/2)), mode='reflect')
+ y = y.squeeze(1)
+
+ spec = torch.stft(y, n_fft, hop_length=hop_size, win_length=win_size, window=hann_window[wnsize_dtype_device],
+ center=center, pad_mode='reflect', normalized=False, onesided=True, return_complex=False)
+
+ spec = torch.sqrt(spec.pow(2).sum(-1) + 1e-6)
+ return spec
+
+
+def spec_to_mel_torch(spec, n_fft, num_mels, sampling_rate, fmin, fmax):
+ global mel_basis
+ dtype_device = str(spec.dtype) + '_' + str(spec.device)
+ fmax_dtype_device = str(fmax) + '_' + dtype_device
+ if fmax_dtype_device not in mel_basis:
+ mel = librosa_mel_fn(sr=sampling_rate, n_fft=n_fft, n_mels=num_mels, fmin=fmin, fmax=fmax)
+ mel_basis[fmax_dtype_device] = torch.from_numpy(mel).to(dtype=spec.dtype, device=spec.device)
+ spec = torch.matmul(mel_basis[fmax_dtype_device], spec)
+ spec = spectral_normalize_torch(spec)
+ return spec
+
+
+def mel_spectrogram_torch(y, n_fft, num_mels, sampling_rate, hop_size, win_size, fmin, fmax, center=False):
+ if torch.min(y) < -1.:
+ print('min value is ', torch.min(y))
+ if torch.max(y) > 1.:
+ print('max value is ', torch.max(y))
+
+ global mel_basis, hann_window
+ dtype_device = str(y.dtype) + '_' + str(y.device)
+ fmax_dtype_device = str(fmax) + '_' + dtype_device
+ wnsize_dtype_device = str(win_size) + '_' + dtype_device
+ if fmax_dtype_device not in mel_basis:
+ mel = librosa_mel_fn(sr=sampling_rate, n_fft=n_fft, n_mels=num_mels, fmin=fmin, fmax=fmax)
+ mel_basis[fmax_dtype_device] = torch.from_numpy(mel).to(dtype=y.dtype, device=y.device)
+ if wnsize_dtype_device not in hann_window:
+ hann_window[wnsize_dtype_device] = torch.hann_window(win_size).to(dtype=y.dtype, device=y.device)
+
+ y = torch.nn.functional.pad(y.unsqueeze(1), (int((n_fft-hop_size)/2), int((n_fft-hop_size)/2)), mode='reflect')
+ y = y.squeeze(1)
+
+ spec = torch.stft(y, n_fft, hop_length=hop_size, win_length=win_size, window=hann_window[wnsize_dtype_device],
+ center=center, pad_mode='reflect', normalized=False, onesided=True, return_complex=False)
+
+ spec = torch.sqrt(spec.pow(2).sum(-1) + 1e-6)
+
+ spec = torch.matmul(mel_basis[fmax_dtype_device], spec)
+ spec = spectral_normalize_torch(spec)
+
+ return spec
diff --git a/so-vits-svc/modules/modules.py b/so-vits-svc/modules/modules.py
new file mode 100644
index 0000000000000000000000000000000000000000..54290fd207b25e93831bd21005990ea137e6b50e
--- /dev/null
+++ b/so-vits-svc/modules/modules.py
@@ -0,0 +1,342 @@
+import copy
+import math
+import numpy as np
+import scipy
+import torch
+from torch import nn
+from torch.nn import functional as F
+
+from torch.nn import Conv1d, ConvTranspose1d, AvgPool1d, Conv2d
+from torch.nn.utils import weight_norm, remove_weight_norm
+
+import modules.commons as commons
+from modules.commons import init_weights, get_padding
+
+
+LRELU_SLOPE = 0.1
+
+
+class LayerNorm(nn.Module):
+ def __init__(self, channels, eps=1e-5):
+ super().__init__()
+ self.channels = channels
+ self.eps = eps
+
+ self.gamma = nn.Parameter(torch.ones(channels))
+ self.beta = nn.Parameter(torch.zeros(channels))
+
+ def forward(self, x):
+ x = x.transpose(1, -1)
+ x = F.layer_norm(x, (self.channels,), self.gamma, self.beta, self.eps)
+ return x.transpose(1, -1)
+
+
+class ConvReluNorm(nn.Module):
+ def __init__(self, in_channels, hidden_channels, out_channels, kernel_size, n_layers, p_dropout):
+ super().__init__()
+ self.in_channels = in_channels
+ self.hidden_channels = hidden_channels
+ self.out_channels = out_channels
+ self.kernel_size = kernel_size
+ self.n_layers = n_layers
+ self.p_dropout = p_dropout
+ assert n_layers > 1, "Number of layers should be larger than 0."
+
+ self.conv_layers = nn.ModuleList()
+ self.norm_layers = nn.ModuleList()
+ self.conv_layers.append(nn.Conv1d(in_channels, hidden_channels, kernel_size, padding=kernel_size//2))
+ self.norm_layers.append(LayerNorm(hidden_channels))
+ self.relu_drop = nn.Sequential(
+ nn.ReLU(),
+ nn.Dropout(p_dropout))
+ for _ in range(n_layers-1):
+ self.conv_layers.append(nn.Conv1d(hidden_channels, hidden_channels, kernel_size, padding=kernel_size//2))
+ self.norm_layers.append(LayerNorm(hidden_channels))
+ self.proj = nn.Conv1d(hidden_channels, out_channels, 1)
+ self.proj.weight.data.zero_()
+ self.proj.bias.data.zero_()
+
+ def forward(self, x, x_mask):
+ x_org = x
+ for i in range(self.n_layers):
+ x = self.conv_layers[i](x * x_mask)
+ x = self.norm_layers[i](x)
+ x = self.relu_drop(x)
+ x = x_org + self.proj(x)
+ return x * x_mask
+
+
+class DDSConv(nn.Module):
+ """
+ Dialted and Depth-Separable Convolution
+ """
+ def __init__(self, channels, kernel_size, n_layers, p_dropout=0.):
+ super().__init__()
+ self.channels = channels
+ self.kernel_size = kernel_size
+ self.n_layers = n_layers
+ self.p_dropout = p_dropout
+
+ self.drop = nn.Dropout(p_dropout)
+ self.convs_sep = nn.ModuleList()
+ self.convs_1x1 = nn.ModuleList()
+ self.norms_1 = nn.ModuleList()
+ self.norms_2 = nn.ModuleList()
+ for i in range(n_layers):
+ dilation = kernel_size ** i
+ padding = (kernel_size * dilation - dilation) // 2
+ self.convs_sep.append(nn.Conv1d(channels, channels, kernel_size,
+ groups=channels, dilation=dilation, padding=padding
+ ))
+ self.convs_1x1.append(nn.Conv1d(channels, channels, 1))
+ self.norms_1.append(LayerNorm(channels))
+ self.norms_2.append(LayerNorm(channels))
+
+ def forward(self, x, x_mask, g=None):
+ if g is not None:
+ x = x + g
+ for i in range(self.n_layers):
+ y = self.convs_sep[i](x * x_mask)
+ y = self.norms_1[i](y)
+ y = F.gelu(y)
+ y = self.convs_1x1[i](y)
+ y = self.norms_2[i](y)
+ y = F.gelu(y)
+ y = self.drop(y)
+ x = x + y
+ return x * x_mask
+
+
+class WN(torch.nn.Module):
+ def __init__(self, hidden_channels, kernel_size, dilation_rate, n_layers, gin_channels=0, p_dropout=0):
+ super(WN, self).__init__()
+ assert(kernel_size % 2 == 1)
+ self.hidden_channels =hidden_channels
+ self.kernel_size = kernel_size,
+ self.dilation_rate = dilation_rate
+ self.n_layers = n_layers
+ self.gin_channels = gin_channels
+ self.p_dropout = p_dropout
+
+ self.in_layers = torch.nn.ModuleList()
+ self.res_skip_layers = torch.nn.ModuleList()
+ self.drop = nn.Dropout(p_dropout)
+
+ if gin_channels != 0:
+ cond_layer = torch.nn.Conv1d(gin_channels, 2*hidden_channels*n_layers, 1)
+ self.cond_layer = torch.nn.utils.weight_norm(cond_layer, name='weight')
+
+ for i in range(n_layers):
+ dilation = dilation_rate ** i
+ padding = int((kernel_size * dilation - dilation) / 2)
+ in_layer = torch.nn.Conv1d(hidden_channels, 2*hidden_channels, kernel_size,
+ dilation=dilation, padding=padding)
+ in_layer = torch.nn.utils.weight_norm(in_layer, name='weight')
+ self.in_layers.append(in_layer)
+
+ # last one is not necessary
+ if i < n_layers - 1:
+ res_skip_channels = 2 * hidden_channels
+ else:
+ res_skip_channels = hidden_channels
+
+ res_skip_layer = torch.nn.Conv1d(hidden_channels, res_skip_channels, 1)
+ res_skip_layer = torch.nn.utils.weight_norm(res_skip_layer, name='weight')
+ self.res_skip_layers.append(res_skip_layer)
+
+ def forward(self, x, x_mask, g=None, **kwargs):
+ output = torch.zeros_like(x)
+ n_channels_tensor = torch.IntTensor([self.hidden_channels])
+
+ if g is not None:
+ g = self.cond_layer(g)
+
+ for i in range(self.n_layers):
+ x_in = self.in_layers[i](x)
+ if g is not None:
+ cond_offset = i * 2 * self.hidden_channels
+ g_l = g[:,cond_offset:cond_offset+2*self.hidden_channels,:]
+ else:
+ g_l = torch.zeros_like(x_in)
+
+ acts = commons.fused_add_tanh_sigmoid_multiply(
+ x_in,
+ g_l,
+ n_channels_tensor)
+ acts = self.drop(acts)
+
+ res_skip_acts = self.res_skip_layers[i](acts)
+ if i < self.n_layers - 1:
+ res_acts = res_skip_acts[:,:self.hidden_channels,:]
+ x = (x + res_acts) * x_mask
+ output = output + res_skip_acts[:,self.hidden_channels:,:]
+ else:
+ output = output + res_skip_acts
+ return output * x_mask
+
+ def remove_weight_norm(self):
+ if self.gin_channels != 0:
+ torch.nn.utils.remove_weight_norm(self.cond_layer)
+ for l in self.in_layers:
+ torch.nn.utils.remove_weight_norm(l)
+ for l in self.res_skip_layers:
+ torch.nn.utils.remove_weight_norm(l)
+
+
+class ResBlock1(torch.nn.Module):
+ def __init__(self, channels, kernel_size=3, dilation=(1, 3, 5)):
+ super(ResBlock1, self).__init__()
+ self.convs1 = nn.ModuleList([
+ weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[0],
+ padding=get_padding(kernel_size, dilation[0]))),
+ weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[1],
+ padding=get_padding(kernel_size, dilation[1]))),
+ weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[2],
+ padding=get_padding(kernel_size, dilation[2])))
+ ])
+ self.convs1.apply(init_weights)
+
+ self.convs2 = nn.ModuleList([
+ weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=1,
+ padding=get_padding(kernel_size, 1))),
+ weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=1,
+ padding=get_padding(kernel_size, 1))),
+ weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=1,
+ padding=get_padding(kernel_size, 1)))
+ ])
+ self.convs2.apply(init_weights)
+
+ def forward(self, x, x_mask=None):
+ for c1, c2 in zip(self.convs1, self.convs2):
+ xt = F.leaky_relu(x, LRELU_SLOPE)
+ if x_mask is not None:
+ xt = xt * x_mask
+ xt = c1(xt)
+ xt = F.leaky_relu(xt, LRELU_SLOPE)
+ if x_mask is not None:
+ xt = xt * x_mask
+ xt = c2(xt)
+ x = xt + x
+ if x_mask is not None:
+ x = x * x_mask
+ return x
+
+ def remove_weight_norm(self):
+ for l in self.convs1:
+ remove_weight_norm(l)
+ for l in self.convs2:
+ remove_weight_norm(l)
+
+
+class ResBlock2(torch.nn.Module):
+ def __init__(self, channels, kernel_size=3, dilation=(1, 3)):
+ super(ResBlock2, self).__init__()
+ self.convs = nn.ModuleList([
+ weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[0],
+ padding=get_padding(kernel_size, dilation[0]))),
+ weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[1],
+ padding=get_padding(kernel_size, dilation[1])))
+ ])
+ self.convs.apply(init_weights)
+
+ def forward(self, x, x_mask=None):
+ for c in self.convs:
+ xt = F.leaky_relu(x, LRELU_SLOPE)
+ if x_mask is not None:
+ xt = xt * x_mask
+ xt = c(xt)
+ x = xt + x
+ if x_mask is not None:
+ x = x * x_mask
+ return x
+
+ def remove_weight_norm(self):
+ for l in self.convs:
+ remove_weight_norm(l)
+
+
+class Log(nn.Module):
+ def forward(self, x, x_mask, reverse=False, **kwargs):
+ if not reverse:
+ y = torch.log(torch.clamp_min(x, 1e-5)) * x_mask
+ logdet = torch.sum(-y, [1, 2])
+ return y, logdet
+ else:
+ x = torch.exp(x) * x_mask
+ return x
+
+
+class Flip(nn.Module):
+ def forward(self, x, *args, reverse=False, **kwargs):
+ x = torch.flip(x, [1])
+ if not reverse:
+ logdet = torch.zeros(x.size(0)).to(dtype=x.dtype, device=x.device)
+ return x, logdet
+ else:
+ return x
+
+
+class ElementwiseAffine(nn.Module):
+ def __init__(self, channels):
+ super().__init__()
+ self.channels = channels
+ self.m = nn.Parameter(torch.zeros(channels,1))
+ self.logs = nn.Parameter(torch.zeros(channels,1))
+
+ def forward(self, x, x_mask, reverse=False, **kwargs):
+ if not reverse:
+ y = self.m + torch.exp(self.logs) * x
+ y = y * x_mask
+ logdet = torch.sum(self.logs * x_mask, [1,2])
+ return y, logdet
+ else:
+ x = (x - self.m) * torch.exp(-self.logs) * x_mask
+ return x
+
+
+class ResidualCouplingLayer(nn.Module):
+ def __init__(self,
+ channels,
+ hidden_channels,
+ kernel_size,
+ dilation_rate,
+ n_layers,
+ p_dropout=0,
+ gin_channels=0,
+ mean_only=False):
+ assert channels % 2 == 0, "channels should be divisible by 2"
+ super().__init__()
+ self.channels = channels
+ self.hidden_channels = hidden_channels
+ self.kernel_size = kernel_size
+ self.dilation_rate = dilation_rate
+ self.n_layers = n_layers
+ self.half_channels = channels // 2
+ self.mean_only = mean_only
+
+ self.pre = nn.Conv1d(self.half_channels, hidden_channels, 1)
+ self.enc = WN(hidden_channels, kernel_size, dilation_rate, n_layers, p_dropout=p_dropout, gin_channels=gin_channels)
+ self.post = nn.Conv1d(hidden_channels, self.half_channels * (2 - mean_only), 1)
+ self.post.weight.data.zero_()
+ self.post.bias.data.zero_()
+
+ def forward(self, x, x_mask, g=None, reverse=False):
+ x0, x1 = torch.split(x, [self.half_channels]*2, 1)
+ h = self.pre(x0) * x_mask
+ h = self.enc(h, x_mask, g=g)
+ stats = self.post(h) * x_mask
+ if not self.mean_only:
+ m, logs = torch.split(stats, [self.half_channels]*2, 1)
+ else:
+ m = stats
+ logs = torch.zeros_like(m)
+
+ if not reverse:
+ x1 = m + x1 * torch.exp(logs) * x_mask
+ x = torch.cat([x0, x1], 1)
+ logdet = torch.sum(logs, [1,2])
+ return x, logdet
+ else:
+ x1 = (x1 - m) * torch.exp(-logs) * x_mask
+ x = torch.cat([x0, x1], 1)
+ return x
diff --git a/so-vits-svc/onnx_export.py b/so-vits-svc/onnx_export.py
new file mode 100644
index 0000000000000000000000000000000000000000..e392523c54d8a2f264924fe9db1b15c72f2222a3
--- /dev/null
+++ b/so-vits-svc/onnx_export.py
@@ -0,0 +1,56 @@
+import torch
+from onnxexport.model_onnx import SynthesizerTrn
+import utils
+
+def main(NetExport):
+ path = "SoVits4.0"
+ if NetExport:
+ device = torch.device("cpu")
+ hps = utils.get_hparams_from_file(f"checkpoints/{path}/config.json")
+ SVCVITS = SynthesizerTrn(
+ hps.data.filter_length // 2 + 1,
+ hps.train.segment_size // hps.data.hop_length,
+ **hps.model)
+ _ = utils.load_checkpoint(f"checkpoints/{path}/model.pth", SVCVITS, None)
+ _ = SVCVITS.eval().to(device)
+ for i in SVCVITS.parameters():
+ i.requires_grad = False
+
+ n_frame = 10
+ hidden_channels = 256 #(Hubert's shape[2])
+
+ test_hidden_unit = torch.rand(1, n_frame, hidden_channels)
+ test_pitch = torch.rand(1, n_frame)
+ test_mel2ph = torch.arange(0, n_frame, dtype=torch.int64)[None] # torch.LongTensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]).unsqueeze(0)
+ test_uv = torch.ones(1, n_frame, dtype=torch.float32)
+ test_noise = torch.randn(1, 192, n_frame)
+ test_sid = torch.LongTensor([0])
+ input_names = ["c", "f0", "mel2ph", "uv", "noise", "sid"]
+ output_names = ["audio", ]
+
+ torch.onnx.export(SVCVITS,
+ (
+ test_hidden_unit.to(device),
+ test_pitch.to(device),
+ test_mel2ph.to(device),
+ test_uv.to(device),
+ test_noise.to(device),
+ test_sid.to(device)
+ ),
+ f"checkpoints/{path}/model.onnx",
+ dynamic_axes={
+ "c": [0, 1],
+ "f0": [1],
+ "mel2ph": [1],
+ "uv": [1],
+ "noise": [2],
+ },
+ do_constant_folding=False,
+ opset_version=16,
+ verbose=False,
+ input_names=input_names,
+ output_names=output_names)
+
+
+if __name__ == '__main__':
+ main(True)
diff --git a/so-vits-svc/onnxexport/model_onnx.py b/so-vits-svc/onnxexport/model_onnx.py
new file mode 100644
index 0000000000000000000000000000000000000000..e28bae95ec1e53aa05d06fc784ff86d55f228d60
--- /dev/null
+++ b/so-vits-svc/onnxexport/model_onnx.py
@@ -0,0 +1,335 @@
+import torch
+from torch import nn
+from torch.nn import functional as F
+
+import modules.attentions as attentions
+import modules.commons as commons
+import modules.modules as modules
+
+from torch.nn import Conv1d, ConvTranspose1d, AvgPool1d, Conv2d
+from torch.nn.utils import weight_norm, remove_weight_norm, spectral_norm
+
+import utils
+from modules.commons import init_weights, get_padding
+from vdecoder.hifigan.models import Generator
+from utils import f0_to_coarse
+
+
+class ResidualCouplingBlock(nn.Module):
+ def __init__(self,
+ channels,
+ hidden_channels,
+ kernel_size,
+ dilation_rate,
+ n_layers,
+ n_flows=4,
+ gin_channels=0):
+ super().__init__()
+ self.channels = channels
+ self.hidden_channels = hidden_channels
+ self.kernel_size = kernel_size
+ self.dilation_rate = dilation_rate
+ self.n_layers = n_layers
+ self.n_flows = n_flows
+ self.gin_channels = gin_channels
+
+ self.flows = nn.ModuleList()
+ for i in range(n_flows):
+ self.flows.append(
+ modules.ResidualCouplingLayer(channels, hidden_channels, kernel_size, dilation_rate, n_layers,
+ gin_channels=gin_channels, mean_only=True))
+ self.flows.append(modules.Flip())
+
+ def forward(self, x, x_mask, g=None, reverse=False):
+ if not reverse:
+ for flow in self.flows:
+ x, _ = flow(x, x_mask, g=g, reverse=reverse)
+ else:
+ for flow in reversed(self.flows):
+ x = flow(x, x_mask, g=g, reverse=reverse)
+ return x
+
+
+class Encoder(nn.Module):
+ def __init__(self,
+ in_channels,
+ out_channels,
+ hidden_channels,
+ kernel_size,
+ dilation_rate,
+ n_layers,
+ gin_channels=0):
+ super().__init__()
+ self.in_channels = in_channels
+ self.out_channels = out_channels
+ self.hidden_channels = hidden_channels
+ self.kernel_size = kernel_size
+ self.dilation_rate = dilation_rate
+ self.n_layers = n_layers
+ self.gin_channels = gin_channels
+
+ self.pre = nn.Conv1d(in_channels, hidden_channels, 1)
+ self.enc = modules.WN(hidden_channels, kernel_size, dilation_rate, n_layers, gin_channels=gin_channels)
+ self.proj = nn.Conv1d(hidden_channels, out_channels * 2, 1)
+
+ def forward(self, x, x_lengths, g=None):
+ # print(x.shape,x_lengths.shape)
+ x_mask = torch.unsqueeze(commons.sequence_mask(x_lengths, x.size(2)), 1).to(x.dtype)
+ x = self.pre(x) * x_mask
+ x = self.enc(x, x_mask, g=g)
+ stats = self.proj(x) * x_mask
+ m, logs = torch.split(stats, self.out_channels, dim=1)
+ z = (m + torch.randn_like(m) * torch.exp(logs)) * x_mask
+ return z, m, logs, x_mask
+
+
+class TextEncoder(nn.Module):
+ def __init__(self,
+ out_channels,
+ hidden_channels,
+ kernel_size,
+ n_layers,
+ gin_channels=0,
+ filter_channels=None,
+ n_heads=None,
+ p_dropout=None):
+ super().__init__()
+ self.out_channels = out_channels
+ self.hidden_channels = hidden_channels
+ self.kernel_size = kernel_size
+ self.n_layers = n_layers
+ self.gin_channels = gin_channels
+ self.proj = nn.Conv1d(hidden_channels, out_channels * 2, 1)
+ self.f0_emb = nn.Embedding(256, hidden_channels)
+
+ self.enc_ = attentions.Encoder(
+ hidden_channels,
+ filter_channels,
+ n_heads,
+ n_layers,
+ kernel_size,
+ p_dropout)
+
+ def forward(self, x, x_mask, f0=None, z=None):
+ x = x + self.f0_emb(f0).transpose(1, 2)
+ x = self.enc_(x * x_mask, x_mask)
+ stats = self.proj(x) * x_mask
+ m, logs = torch.split(stats, self.out_channels, dim=1)
+ z = (m + z * torch.exp(logs)) * x_mask
+ return z, m, logs, x_mask
+
+
+class DiscriminatorP(torch.nn.Module):
+ def __init__(self, period, kernel_size=5, stride=3, use_spectral_norm=False):
+ super(DiscriminatorP, self).__init__()
+ self.period = period
+ self.use_spectral_norm = use_spectral_norm
+ norm_f = weight_norm if use_spectral_norm == False else spectral_norm
+ self.convs = nn.ModuleList([
+ norm_f(Conv2d(1, 32, (kernel_size, 1), (stride, 1), padding=(get_padding(kernel_size, 1), 0))),
+ norm_f(Conv2d(32, 128, (kernel_size, 1), (stride, 1), padding=(get_padding(kernel_size, 1), 0))),
+ norm_f(Conv2d(128, 512, (kernel_size, 1), (stride, 1), padding=(get_padding(kernel_size, 1), 0))),
+ norm_f(Conv2d(512, 1024, (kernel_size, 1), (stride, 1), padding=(get_padding(kernel_size, 1), 0))),
+ norm_f(Conv2d(1024, 1024, (kernel_size, 1), 1, padding=(get_padding(kernel_size, 1), 0))),
+ ])
+ self.conv_post = norm_f(Conv2d(1024, 1, (3, 1), 1, padding=(1, 0)))
+
+ def forward(self, x):
+ fmap = []
+
+ # 1d to 2d
+ b, c, t = x.shape
+ if t % self.period != 0: # pad first
+ n_pad = self.period - (t % self.period)
+ x = F.pad(x, (0, n_pad), "reflect")
+ t = t + n_pad
+ x = x.view(b, c, t // self.period, self.period)
+
+ for l in self.convs:
+ x = l(x)
+ x = F.leaky_relu(x, modules.LRELU_SLOPE)
+ fmap.append(x)
+ x = self.conv_post(x)
+ fmap.append(x)
+ x = torch.flatten(x, 1, -1)
+
+ return x, fmap
+
+
+class DiscriminatorS(torch.nn.Module):
+ def __init__(self, use_spectral_norm=False):
+ super(DiscriminatorS, self).__init__()
+ norm_f = weight_norm if use_spectral_norm == False else spectral_norm
+ self.convs = nn.ModuleList([
+ norm_f(Conv1d(1, 16, 15, 1, padding=7)),
+ norm_f(Conv1d(16, 64, 41, 4, groups=4, padding=20)),
+ norm_f(Conv1d(64, 256, 41, 4, groups=16, padding=20)),
+ norm_f(Conv1d(256, 1024, 41, 4, groups=64, padding=20)),
+ norm_f(Conv1d(1024, 1024, 41, 4, groups=256, padding=20)),
+ norm_f(Conv1d(1024, 1024, 5, 1, padding=2)),
+ ])
+ self.conv_post = norm_f(Conv1d(1024, 1, 3, 1, padding=1))
+
+ def forward(self, x):
+ fmap = []
+
+ for l in self.convs:
+ x = l(x)
+ x = F.leaky_relu(x, modules.LRELU_SLOPE)
+ fmap.append(x)
+ x = self.conv_post(x)
+ fmap.append(x)
+ x = torch.flatten(x, 1, -1)
+
+ return x, fmap
+
+
+class F0Decoder(nn.Module):
+ def __init__(self,
+ out_channels,
+ hidden_channels,
+ filter_channels,
+ n_heads,
+ n_layers,
+ kernel_size,
+ p_dropout,
+ spk_channels=0):
+ super().__init__()
+ self.out_channels = out_channels
+ self.hidden_channels = hidden_channels
+ self.filter_channels = filter_channels
+ self.n_heads = n_heads
+ self.n_layers = n_layers
+ self.kernel_size = kernel_size
+ self.p_dropout = p_dropout
+ self.spk_channels = spk_channels
+
+ self.prenet = nn.Conv1d(hidden_channels, hidden_channels, 3, padding=1)
+ self.decoder = attentions.FFT(
+ hidden_channels,
+ filter_channels,
+ n_heads,
+ n_layers,
+ kernel_size,
+ p_dropout)
+ self.proj = nn.Conv1d(hidden_channels, out_channels, 1)
+ self.f0_prenet = nn.Conv1d(1, hidden_channels, 3, padding=1)
+ self.cond = nn.Conv1d(spk_channels, hidden_channels, 1)
+
+ def forward(self, x, norm_f0, x_mask, spk_emb=None):
+ x = torch.detach(x)
+ if spk_emb is not None:
+ x = x + self.cond(spk_emb)
+ x += self.f0_prenet(norm_f0)
+ x = self.prenet(x) * x_mask
+ x = self.decoder(x * x_mask, x_mask)
+ x = self.proj(x) * x_mask
+ return x
+
+
+class SynthesizerTrn(nn.Module):
+ """
+ Synthesizer for Training
+ """
+
+ def __init__(self,
+ spec_channels,
+ segment_size,
+ inter_channels,
+ hidden_channels,
+ filter_channels,
+ n_heads,
+ n_layers,
+ kernel_size,
+ p_dropout,
+ resblock,
+ resblock_kernel_sizes,
+ resblock_dilation_sizes,
+ upsample_rates,
+ upsample_initial_channel,
+ upsample_kernel_sizes,
+ gin_channels,
+ ssl_dim,
+ n_speakers,
+ sampling_rate=44100,
+ **kwargs):
+ super().__init__()
+ self.spec_channels = spec_channels
+ self.inter_channels = inter_channels
+ self.hidden_channels = hidden_channels
+ self.filter_channels = filter_channels
+ self.n_heads = n_heads
+ self.n_layers = n_layers
+ self.kernel_size = kernel_size
+ self.p_dropout = p_dropout
+ self.resblock = resblock
+ self.resblock_kernel_sizes = resblock_kernel_sizes
+ self.resblock_dilation_sizes = resblock_dilation_sizes
+ self.upsample_rates = upsample_rates
+ self.upsample_initial_channel = upsample_initial_channel
+ self.upsample_kernel_sizes = upsample_kernel_sizes
+ self.segment_size = segment_size
+ self.gin_channels = gin_channels
+ self.ssl_dim = ssl_dim
+ self.emb_g = nn.Embedding(n_speakers, gin_channels)
+
+ self.pre = nn.Conv1d(ssl_dim, hidden_channels, kernel_size=5, padding=2)
+
+ self.enc_p = TextEncoder(
+ inter_channels,
+ hidden_channels,
+ filter_channels=filter_channels,
+ n_heads=n_heads,
+ n_layers=n_layers,
+ kernel_size=kernel_size,
+ p_dropout=p_dropout
+ )
+ hps = {
+ "sampling_rate": sampling_rate,
+ "inter_channels": inter_channels,
+ "resblock": resblock,
+ "resblock_kernel_sizes": resblock_kernel_sizes,
+ "resblock_dilation_sizes": resblock_dilation_sizes,
+ "upsample_rates": upsample_rates,
+ "upsample_initial_channel": upsample_initial_channel,
+ "upsample_kernel_sizes": upsample_kernel_sizes,
+ "gin_channels": gin_channels,
+ }
+ self.dec = Generator(h=hps)
+ self.enc_q = Encoder(spec_channels, inter_channels, hidden_channels, 5, 1, 16, gin_channels=gin_channels)
+ self.flow = ResidualCouplingBlock(inter_channels, hidden_channels, 5, 1, 4, gin_channels=gin_channels)
+ self.f0_decoder = F0Decoder(
+ 1,
+ hidden_channels,
+ filter_channels,
+ n_heads,
+ n_layers,
+ kernel_size,
+ p_dropout,
+ spk_channels=gin_channels
+ )
+ self.emb_uv = nn.Embedding(2, hidden_channels)
+ self.predict_f0 = False
+
+ def forward(self, c, f0, mel2ph, uv, noise=None, g=None):
+
+ decoder_inp = F.pad(c, [0, 0, 1, 0])
+ mel2ph_ = mel2ph.unsqueeze(2).repeat([1, 1, c.shape[-1]])
+ c = torch.gather(decoder_inp, 1, mel2ph_).transpose(1, 2) # [B, T, H]
+
+ c_lengths = (torch.ones(c.size(0)) * c.size(-1)).to(c.device)
+ g = g.unsqueeze(0)
+ g = self.emb_g(g).transpose(1, 2)
+ x_mask = torch.unsqueeze(commons.sequence_mask(c_lengths, c.size(2)), 1).to(c.dtype)
+ x = self.pre(c) * x_mask + self.emb_uv(uv.long()).transpose(1, 2)
+
+ if self.predict_f0:
+ lf0 = 2595. * torch.log10(1. + f0.unsqueeze(1) / 700.) / 500
+ norm_lf0 = utils.normalize_f0(lf0, x_mask, uv, random_scale=False)
+ pred_lf0 = self.f0_decoder(x, norm_lf0, x_mask, spk_emb=g)
+ f0 = (700 * (torch.pow(10, pred_lf0 * 500 / 2595) - 1)).squeeze(1)
+
+ z_p, m_p, logs_p, c_mask = self.enc_p(x, x_mask, f0=f0_to_coarse(f0), z=noise)
+ z = self.flow(z_p, c_mask, g=g, reverse=True)
+ o = self.dec(z * c_mask, g=g, f0=f0)
+ return o
diff --git a/so-vits-svc/preprocess_flist_config.py b/so-vits-svc/preprocess_flist_config.py
new file mode 100644
index 0000000000000000000000000000000000000000..ac946865d42801fb8e710973f0af6788e47ff3a0
--- /dev/null
+++ b/so-vits-svc/preprocess_flist_config.py
@@ -0,0 +1,75 @@
+import os
+import argparse
+import re
+
+from tqdm import tqdm
+from random import shuffle
+import json
+import wave
+
+config_template = json.load(open("configs_template/config_template.json"))
+
+pattern = re.compile(r'^[\.a-zA-Z0-9_\/]+$')
+
+def get_wav_duration(file_path):
+ with wave.open(file_path, 'rb') as wav_file:
+ # get audio frames
+ n_frames = wav_file.getnframes()
+ # get sampling rate
+ framerate = wav_file.getframerate()
+ # calculate duration in seconds
+ duration = n_frames / float(framerate)
+ return duration
+
+if __name__ == "__main__":
+ parser = argparse.ArgumentParser()
+ parser.add_argument("--train_list", type=str, default="./filelists/train.txt", help="path to train list")
+ parser.add_argument("--val_list", type=str, default="./filelists/val.txt", help="path to val list")
+ parser.add_argument("--source_dir", type=str, default="./dataset/44k", help="path to source dir")
+ args = parser.parse_args()
+
+ train = []
+ val = []
+ idx = 0
+ spk_dict = {}
+ spk_id = 0
+ for speaker in tqdm(os.listdir(args.source_dir)):
+ spk_dict[speaker] = spk_id
+ spk_id += 1
+ wavs = ["/".join([args.source_dir, speaker, i]) for i in os.listdir(os.path.join(args.source_dir, speaker))]
+ new_wavs = []
+ for file in wavs:
+ if not file.endswith("wav"):
+ continue
+ if not pattern.match(file):
+ print(f"Warning: The file name of {file} contains non-alphanumeric and underscores, which may cause issues. (or maybe not)")
+ if get_wav_duration(file) < 0.3:
+ print("skip too short audio:", file)
+ continue
+ new_wavs.append(file)
+ wavs = new_wavs
+ shuffle(wavs)
+ train += wavs[2:]
+ val += wavs[:2]
+
+ shuffle(train)
+ shuffle(val)
+
+ print("Writing", args.train_list)
+ with open(args.train_list, "w") as f:
+ for fname in tqdm(train):
+ wavpath = fname
+ f.write(wavpath + "\n")
+
+ print("Writing", args.val_list)
+ with open(args.val_list, "w") as f:
+ for fname in tqdm(val):
+ wavpath = fname
+ f.write(wavpath + "\n")
+
+ config_template["spk"] = spk_dict
+ config_template["model"]["n_speakers"] = spk_id
+
+ print("Writing configs/config.json")
+ with open("configs/config.json", "w") as f:
+ json.dump(config_template, f, indent=2)
diff --git a/so-vits-svc/preprocess_hubert_f0.py b/so-vits-svc/preprocess_hubert_f0.py
new file mode 100644
index 0000000000000000000000000000000000000000..763fb0d65540ed4d62b269914e81c740f3ff6bba
--- /dev/null
+++ b/so-vits-svc/preprocess_hubert_f0.py
@@ -0,0 +1,101 @@
+import math
+import multiprocessing
+import os
+import argparse
+from random import shuffle
+
+import torch
+from glob import glob
+from tqdm import tqdm
+from modules.mel_processing import spectrogram_torch
+
+import utils
+import logging
+
+logging.getLogger("numba").setLevel(logging.WARNING)
+import librosa
+import numpy as np
+
+hps = utils.get_hparams_from_file("configs/config.json")
+sampling_rate = hps.data.sampling_rate
+hop_length = hps.data.hop_length
+
+
+def process_one(filename, hmodel):
+ # print(filename)
+ wav, sr = librosa.load(filename, sr=sampling_rate)
+ soft_path = filename + ".soft.pt"
+ if not os.path.exists(soft_path):
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+ wav16k = librosa.resample(wav, orig_sr=sampling_rate, target_sr=16000)
+ wav16k = torch.from_numpy(wav16k).to(device)
+ c = utils.get_hubert_content(hmodel, wav_16k_tensor=wav16k)
+ torch.save(c.cpu(), soft_path)
+
+ f0_path = filename + ".f0.npy"
+ if not os.path.exists(f0_path):
+ f0 = utils.compute_f0_dio(
+ wav, sampling_rate=sampling_rate, hop_length=hop_length
+ )
+ np.save(f0_path, f0)
+
+ spec_path = filename.replace(".wav", ".spec.pt")
+ if not os.path.exists(spec_path):
+ # Process spectrogram
+ # The following code can't be replaced by torch.FloatTensor(wav)
+ # because load_wav_to_torch return a tensor that need to be normalized
+
+ audio, sr = utils.load_wav_to_torch(filename)
+ if sr != hps.data.sampling_rate:
+ raise ValueError(
+ "{} SR doesn't match target {} SR".format(
+ sr, hps.data.sampling_rate
+ )
+ )
+
+ audio_norm = audio / hps.data.max_wav_value
+ audio_norm = audio_norm.unsqueeze(0)
+
+ spec = spectrogram_torch(
+ audio_norm,
+ hps.data.filter_length,
+ hps.data.sampling_rate,
+ hps.data.hop_length,
+ hps.data.win_length,
+ center=False,
+ )
+ spec = torch.squeeze(spec, 0)
+ torch.save(spec, spec_path)
+
+
+def process_batch(filenames):
+ print("Loading hubert for content...")
+ device = "cuda" if torch.cuda.is_available() else "cpu"
+ hmodel = utils.get_hubert_model().to(device)
+ print("Loaded hubert.")
+ for filename in tqdm(filenames):
+ process_one(filename, hmodel)
+
+
+if __name__ == "__main__":
+ parser = argparse.ArgumentParser()
+ parser.add_argument(
+ "--in_dir", type=str, default="dataset/44k", help="path to input dir"
+ )
+
+ args = parser.parse_args()
+ filenames = glob(f"{args.in_dir}/*/*.wav", recursive=True) # [:10]
+ shuffle(filenames)
+ multiprocessing.set_start_method("spawn", force=True)
+
+ num_processes = 1
+ chunk_size = int(math.ceil(len(filenames) / num_processes))
+ chunks = [
+ filenames[i : i + chunk_size] for i in range(0, len(filenames), chunk_size)
+ ]
+ print([len(c) for c in chunks])
+ processes = [
+ multiprocessing.Process(target=process_batch, args=(chunk,)) for chunk in chunks
+ ]
+ for p in processes:
+ p.start()
diff --git a/so-vits-svc/pretrain/nsf_hifigan/put_nsf_hifigan_ckpt_here b/so-vits-svc/pretrain/nsf_hifigan/put_nsf_hifigan_ckpt_here
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/so-vits-svc/raw/put_raw_wav_here b/so-vits-svc/raw/put_raw_wav_here
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/so-vits-svc/requirements.txt b/so-vits-svc/requirements.txt
new file mode 100644
index 0000000000000000000000000000000000000000..9dd41d4fd9c0ce7e66ac790a81897696655f7846
--- /dev/null
+++ b/so-vits-svc/requirements.txt
@@ -0,0 +1,21 @@
+Flask
+Flask_Cors
+gradio>=3.7.0
+numpy==1.23.0
+pyworld==0.2.5
+scipy==1.10.0
+SoundFile==0.12.1
+torch==1.13.1
+torchaudio==0.13.1
+torchcrepe
+tqdm
+scikit-maad
+praat-parselmouth
+onnx
+onnxsim
+onnxoptimizer
+fairseq==0.12.2
+librosa==0.9.1
+tensorboard
+tensorboardX
+edge_tts
diff --git a/so-vits-svc/requirements_win.txt b/so-vits-svc/requirements_win.txt
new file mode 100644
index 0000000000000000000000000000000000000000..8201f6d4b0c5b9ea37f49dc74770d48f3cdd334c
--- /dev/null
+++ b/so-vits-svc/requirements_win.txt
@@ -0,0 +1,24 @@
+librosa==0.9.1
+fairseq==0.12.2
+Flask==2.1.2
+Flask_Cors==3.0.10
+gradio>=3.7.0
+numpy
+playsound==1.3.0
+PyAudio==0.2.12
+pydub==0.25.1
+pyworld==0.3.0
+requests==2.28.1
+scipy==1.7.3
+sounddevice==0.4.5
+SoundFile==0.10.3.post1
+starlette==0.19.1
+tqdm==4.63.0
+torchcrepe
+scikit-maad
+praat-parselmouth
+onnx
+onnxsim
+onnxoptimizer
+tensorboardX
+edge_tts
diff --git a/so-vits-svc/resample.py b/so-vits-svc/resample.py
new file mode 100644
index 0000000000000000000000000000000000000000..b28a86eb779d7b3f163e89fac64ecabe044ad1e2
--- /dev/null
+++ b/so-vits-svc/resample.py
@@ -0,0 +1,48 @@
+import os
+import argparse
+import librosa
+import numpy as np
+from multiprocessing import Pool, cpu_count
+from scipy.io import wavfile
+from tqdm import tqdm
+
+
+def process(item):
+ spkdir, wav_name, args = item
+ # speaker 's5', 'p280', 'p315' are excluded,
+ speaker = spkdir.replace("\\", "/").split("/")[-1]
+ wav_path = os.path.join(args.in_dir, speaker, wav_name)
+ if os.path.exists(wav_path) and '.wav' in wav_path:
+ os.makedirs(os.path.join(args.out_dir2, speaker), exist_ok=True)
+ wav, sr = librosa.load(wav_path, sr=None)
+ wav, _ = librosa.effects.trim(wav, top_db=20)
+ peak = np.abs(wav).max()
+ if peak > 1.0:
+ wav = 0.98 * wav / peak
+ wav2 = librosa.resample(wav, orig_sr=sr, target_sr=args.sr2)
+ wav2 /= max(wav2.max(), -wav2.min())
+ save_name = wav_name
+ save_path2 = os.path.join(args.out_dir2, speaker, save_name)
+ wavfile.write(
+ save_path2,
+ args.sr2,
+ (wav2 * np.iinfo(np.int16).max).astype(np.int16)
+ )
+
+
+
+if __name__ == "__main__":
+ parser = argparse.ArgumentParser()
+ parser.add_argument("--sr2", type=int, default=44100, help="sampling rate")
+ parser.add_argument("--in_dir", type=str, default="./dataset_raw", help="path to source dir")
+ parser.add_argument("--out_dir2", type=str, default="./dataset/44k", help="path to target dir")
+ args = parser.parse_args()
+ processs = 30 if cpu_count() > 60 else (cpu_count()-2 if cpu_count() > 4 else 1)
+ pool = Pool(processes=processs)
+
+ for speaker in os.listdir(args.in_dir):
+ spk_dir = os.path.join(args.in_dir, speaker)
+ if os.path.isdir(spk_dir):
+ print(spk_dir)
+ for _ in tqdm(pool.imap_unordered(process, [(spk_dir, i, args) for i in os.listdir(spk_dir) if i.endswith("wav")])):
+ pass
diff --git a/so-vits-svc/sovits4_for_colab.ipynb b/so-vits-svc/sovits4_for_colab.ipynb
new file mode 100644
index 0000000000000000000000000000000000000000..aade34805c13162d0e75035cf58870d01fc3f52f
--- /dev/null
+++ b/so-vits-svc/sovits4_for_colab.ipynb
@@ -0,0 +1 @@
+{"nbformat":4,"nbformat_minor":0,"metadata":{"colab":{"provenance":[{"file_id":"19fxpo-ZoL_ShEUeZIZi6Di-YioWrEyhR","timestamp":1678516497580},{"file_id":"1rCUOOVG7-XQlVZuWRAj5IpGrMM8t07pE","timestamp":1673086970071},{"file_id":"1Ul5SmzWiSHBj0MaKA0B682C-RZKOycwF","timestamp":1670483515921}]},"kernelspec":{"name":"python3","display_name":"Python 3"},"language_info":{"name":"python"},"accelerator":"GPU","gpuClass":"standard"},"cells":[{"cell_type":"markdown","source":["# Terms of Use\n","\n","### Please solve the authorization problem of the dataset on your own. You shall be solely responsible for any problems caused by the use of non-authorized datasets for training and all consequences thereof.The repository and its maintainer, svc develop team, have nothing to do with the consequences!\n","\n","1. This project is established for academic exchange purposes only and is intended for communication and learning purposes. It is not intended for production environments.\n","2. Any videos based on sovits that are published on video platforms must clearly indicate in the description that they are used for voice changing and specify the input source of the voice or audio, for example, using videos or audios published by others and separating the vocals as input source for conversion, which must provide clear original video or music links. If your own voice or other synthesized voices from other commercial vocal synthesis software are used as the input source for conversion, you must also explain it in the description.\n","3. You shall be solely responsible for any infringement problems caused by the input source. When using other commercial vocal synthesis software as input source, please ensure that you comply with the terms of use of the software. Note that many vocal synthesis engines clearly state in their terms of use that they cannot be used for input source conversion.\n","4. Continuing to use this project is deemed as agreeing to the relevant provisions stated in this repository README. This repository README has the obligation to persuade, and is not responsible for any subsequent problems that may arise.\n","5. If you distribute this repository's code or publish any results produced by this project publicly (including but not limited to video sharing platforms), please indicate the original author and code source (this repository).\n","6. If you use this project for any other plan, please contact and inform the author of this repository in advance. Thank you very much.\n"],"metadata":{"id":"2q0l56aFQhAM"}},{"cell_type":"markdown","source":["## **Note:**\n","## **Make sure there is no a directory named `sovits4data` in your google drive at the first time you use this notebook.**\n","## **It will be created to store some necessary files.** \n","## **For sure you can change it to another directory by modifying `sovits_data_dir` variable.**"],"metadata":{"id":"M_RcDbVPhivj"}},{"cell_type":"markdown","source":["# **Initialize environment**"],"metadata":{"id":"fHaw6hGEa_Nk"}},{"cell_type":"code","source":["#@title Connect to colab runtime and check GPU\n","\n","#@markdown # Connect to colab runtime and check GPU\n","\n","#@markdown\n","\n","!nvidia-smi"],"metadata":{"id":"0gQcIZ8RsOkn"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["#@title Clone repository and install requirements\n","\n","#@markdown # Clone repository and install requirements\n","\n","#@markdown\n","\n","#@markdown ### After the execution is completed, the runtime will **automatically restart**\n","\n","#@markdown\n","\n","!git clone https://github.com/svc-develop-team/so-vits-svc -b 4.0\n","!pip uninstall torchdata torchtext\n","!pip install --upgrade pip setuptools numpy numba\n","!pip install pyworld praat-parselmouth fairseq tensorboardX torchcrepe librosa==0.9.1\n","!pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 torchaudio==0.13.1+cu117 --extra-index-url https://download.pytorch.org/whl/cu117\n","%cd /content/so-vits-svc\n","!curl -L https://github.com/openvpi/vocoders/releases/download/nsf-hifigan-v1/nsf_hifigan_20221211.zip -o /content/so-vits-svc/nsf_hifigan_20221211.zip\n","!unzip nsf_hifigan_20221211.zip\n","!rm -rf pretrain/nsf_hifigan\n","!mv -v nsf_hifigan pretrain\n","!curl -L https://ibm.ent.box.com/shared/static/z1wgl1stco8ffooyatzdwsqn2psd9lrr -o /content/so-vits-svc/hubert/checkpoint_best_legacy_500.pt\n","exit()"],"metadata":{"id":"0YUGpYrXhMck"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["#@title Mount google drive and select which directories to sync with google drive\n","\n","#@markdown # Mount google drive and select which directories to sync with google drive\n","\n","#@markdown\n","\n","from google.colab import drive\n","drive.mount(\"/content/drive\")\n","\n","#@markdown Directory to store **necessary files**, dont miss the slash at the end👇.\n","sovits_data_dir = \"/content/drive/MyDrive/sovits4data/\" #@param {type:\"string\"}\n","#@markdown By default it will create a `sovits4data/` folder in your google drive.\n","RAW_DIR = sovits_data_dir + \"raw/\"\n","RESULTS_DIR = sovits_data_dir + \"results/\"\n","FILELISTS_DIR = sovits_data_dir + \"filelists/\"\n","CONFIGS_DIR = sovits_data_dir + \"configs/\"\n","LOGS_DIR = sovits_data_dir + \"logs/44k/\"\n","\n","#@markdown\n","\n","#@markdown ### These folders will be synced with your google drvie\n","\n","#@markdown ### **Strongly recommend to check all.**\n","\n","#@markdown Sync **input audios** and **output audios**\n","sync_raw_and_results = True #@param {type:\"boolean\"}\n","if sync_raw_and_results:\n"," !mkdir -p {RAW_DIR}\n"," !mkdir -p {RESULTS_DIR}\n"," !rm -rf /content/so-vits-svc/raw\n"," !rm -rf /content/so-vits-svc/results\n"," !ln -s {RAW_DIR} /content/so-vits-svc/raw\n"," !ln -s {RESULTS_DIR} /content/so-vits-svc/results\n","\n","#@markdown Sync **config** and **models**\n","sync_configs_and_logs = True #@param {type:\"boolean\"}\n","if sync_configs_and_logs:\n"," !mkdir -p {FILELISTS_DIR}\n"," !mkdir -p {CONFIGS_DIR}\n"," !mkdir -p {LOGS_DIR}\n"," !rm -rf /content/so-vits-svc/filelists\n"," !rm -rf /content/so-vits-svc/configs\n"," !rm -rf /content/so-vits-svc/logs/44k\n"," !ln -s {FILELISTS_DIR} /content/so-vits-svc/filelists\n"," !ln -s {CONFIGS_DIR} /content/so-vits-svc/configs\n"," !ln -s {LOGS_DIR} /content/so-vits-svc/logs/44k"],"metadata":{"id":"wmUkpUmfn_Hs"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["#@title Get pretrained model(Optional but strongly recommend).\n","\n","#@markdown # Get pretrained model(Optional but strongly recommend).\n","\n","#@markdown\n","\n","#@markdown - Pre-trained model files: `G_0.pth` `D_0.pth`\n","#@markdown - Place them under /sovits4data/logs/44k/ in your google drive manualy\n","\n","#@markdown Get them from svc-develop-team(TBD) or anywhere else.\n","\n","#@markdown Although the pretrained model generally does not cause any copyright problems, please pay attention to it. For example, ask the author in advance, or the author has indicated the feasible use in the description clearly.\n","\n","!pwd"],"metadata":{"id":"G_PMPCN6wvgZ"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["# **Dataset preprocessing**"],"metadata":{"id":"k1qadJBFehMo"}},{"cell_type":"markdown","source":["Pack and upload your raw dataset(dataset_raw/) to your google drive.\n","\n","Makesure the file structure in your zip file looks like this:\n","\n","```\n","YourZIPforSingleSpeakers.zip\n","└───speaker\n"," ├───xxx1-xxx1.wav\n"," ├───...\n"," └───Lxx-0xx8.wav\n","```\n","\n","```\n","YourZIPforMultipleSpeakers.zip\n","├───speaker0\n","│ ├───xxx1-xxx1.wav\n","│ ├───...\n","│ └───Lxx-0xx8.wav\n","└───speaker1\n"," ├───xx2-0xxx2.wav\n"," ├───...\n"," └───xxx7-xxx007.wav\n","```\n","\n","**Even if there is only one speaker, a folder named `{speaker_name}` is needed.**\n","\n","![1.png]()\n","\n","![2.png]()"],"metadata":{"id":"kBlju6Q3lSM6"}},{"cell_type":"code","source":["#@title Get raw dataset from google drive\n","\n","#@markdown # Get raw dataset from google drive\n","\n","#@markdown\n","\n","#@markdown Directory where **your zip file** located in, dont miss the slash at the end👇.\n","sovits_data_dir = \"/content/drive/MyDrive/sovits4data/\" #@param {type:\"string\"}\n","#@markdown Filename of **your zip file**, do NOT be \"dataset.zip\"\n","zip_filename = \"YourZIPFilenameofRawDataset.zip\" #@param {type:\"string\"}\n","ZIP_PATH = sovits_data_dir + zip_filename\n","\n","!unzip -od /content/so-vits-svc/dataset_raw {ZIP_PATH}"],"metadata":{"id":"U05CXlAipvJR"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["#@title Resample to 44100Hz and mono\n","\n","#@markdown # Resample to 44100Hz and mono\n","\n","#@markdown\n","\n","%cd /content/so-vits-svc\n","!python resample.py"],"metadata":{"id":"_ThKTzYs5CfL"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["#@title Divide filelists and generate config.json\n","\n","#@markdown # Divide filelists and generate config.json\n","\n","#@markdown\n","\n","%cd /content/so-vits-svc\n","!python preprocess_flist_config.py"],"metadata":{"id":"svITReeL5N8K"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["#@title Generate hubert and f0\n","\n","#@markdown # Generate hubert and f0\n","\n","#@markdown\n","\n","%cd /content/so-vits-svc\n","!python preprocess_hubert_f0.py"],"metadata":{"id":"xHUXMi836DMe"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["#@title Save the preprocessed dataset to google drive\n","\n","#@markdown # Save the preprocessed dataset to google drive\n","\n","#@markdown\n","\n","#@markdown You can save the dataset and related files to your google drive for the next training\n","\n","#@markdown **Directory for saving**, dont miss the slash at the end👇.\n","sovits_data_dir = \"/content/drive/MyDrive/sovits4data/\" #@param {type:\"string\"}\n","\n","#@markdown There will be a `dataset.zip` contained `dataset/` in your google drive, which is preprocessed data.\n","\n","!mkdir -p {sovits_data_dir}\n","!zip -r dataset.zip /content/so-vits-svc/dataset\n","!cp -vr dataset.zip \"{sovits_data_dir}\""],"metadata":{"id":"Wo4OTmTAUXgj"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["#@title Unzip preprocessed dataset from google drive directly if you have preprocessed already.\n","\n","#@markdown # Unzip preprocessed dataset from google drive directly if you have preprocessed already.\n","\n","#@markdown\n","\n","#@markdown Directory where **your preprocessed dataset** located in, dont miss the slash at the end👇.\n","sovits_data_dir = \"/content/drive/MyDrive/sovits4data/\" #@param {type:\"string\"}\n","CONFIG = sovits_data_dir + \"configs/\"\n","FILELISTS = sovits_data_dir + \"filelists/\"\n","DATASET = sovits_data_dir + \"dataset.zip\"\n","\n","!cp -vr {CONFIG} /content/so-vits-svc/\n","!cp -vr {FILELISTS} /content/so-vits-svc/\n","!unzip {DATASET} -d /"],"metadata":{"id":"P2G6v_6zblWK"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["# **Trainning**"],"metadata":{"id":"ENoH-pShel7w"}},{"cell_type":"code","source":["#@title Start training\n","\n","#@markdown # Start training\n","\n","#@markdown If you want to use pre-trained models, upload them to /sovits4data/logs/44k/ in your google drive manualy.\n","\n","#@markdown\n","\n","#@markdown Whether to enable tensorboard\n","tensorboard_on = True #@param {type:\"boolean\"}\n","\n","if tensorboard_on:\n"," %load_ext tensorboard\n"," %tensorboard --logdir logs/44k\n","\n","%cd /content/so-vits-svc\n","!python train.py -c configs/config.json -m 44k"],"metadata":{"id":"-hEFFTCfZf57"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["#@title Train cluster model (Optional)\n","\n","#@markdown # Train cluster model (Optional)\n","\n","#@markdown #### Details see [README.md#cluster-based-timbre-leakage-control](https://github.com/svc-develop-team/so-vits-svc#cluster-based-timbre-leakage-control)\n","\n","#@markdown\n","\n","%cd /content/so-vits-svc\n","!python cluster/train_cluster.py"],"metadata":{"id":"ZThaMxmIJgWy"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["# **Inference**\n","### Upload wav files from this notebook\n","### **OR**\n","### Upload to `sovits4data/raw/` in your google drive manualy (should be faster)"],"metadata":{"id":"oCnbX-OT897k"}},{"cell_type":"code","source":["#@title Upload wav files, the filename should not contain any special symbols like `#` `$` `(` `)`\n","\n","#@markdown # Upload wav files, the filename should not contain any special symbols like `#` `$` `(` `)`\n","\n","#@markdown\n","\n","%cd /content/so-vits-svc\n","%run wav_upload.py --type audio"],"metadata":{"id":"XUsmGkgCMD_Q","colab":{"base_uri":"https://localhost:8080/","height":75},"executionInfo":{"status":"ok","timestamp":1678591088790,"user_tz":-480,"elapsed":94633,"user":{"displayName":"謬紗特","userId":"09445825975794260265"}},"outputId":"8bbfde13-030a-4ba0-bbdb-7eb6b89c02b4"},"execution_count":null,"outputs":[{"output_type":"display_data","data":{"text/plain":[""],"text/html":["\n"," \n"," \n"," Upload widget is only available when the cell has been executed in the\n"," current browser session. Please rerun this cell to enable.\n"," \n"," "]},"metadata":{}},{"output_type":"stream","name":"stdout","text":["Saving YourWAVFile.wav to YourWAVFile.wav\n"]}]},{"cell_type":"code","source":["#@title Start inference (and download)\n","\n","#@markdown # Start inference (and download)\n","\n","#@markdown Parameters see [README.MD#Inference](https://github.com/svc-develop-team/so-vits-svc#-inference)\n","\n","#@markdown\n","\n","wav_filename = \"YourWAVFile.wav\" #@param {type:\"string\"}\n","model_filename = \"G_210000.pth\" #@param {type:\"string\"}\n","model_path = \"/content/so-vits-svc/logs/44k/\" + model_filename\n","speaker = \"YourSpeaker\" #@param {type:\"string\"}\n","trans = \"0\" #@param {type:\"string\"}\n","cluster_infer_ratio = \"0\" #@param {type:\"string\"}\n","f0_mean_pooling = False #@param {type:\"boolean\"}\n","fmp = \"\"\n","if f0_mean_pooling:\n"," fmp = \" -fmp \"\n","auto_predict_f0 = False #@param {type:\"boolean\"}\n","apf = \"\"\n","if auto_predict_f0:\n"," apf = \" -a \"\n","#@markdown\n","\n","#@markdown Generally keep default:\n","config_filename = \"config.json\" #@param {type:\"string\"}\n","config_path = \"/content/so-vits-svc/configs/\" + config_filename\n","kmeans_filenname = \"kmeans_10000.pt\" #@param {type:\"string\"}\n","kmeans_path = \"/content/so-vits-svc/logs/44k/\" + kmeans_filenname\n","slice_db = \"-40\" #@param {type:\"string\"}\n","wav_format = \"flac\" #@param {type:\"string\"}\n","wav_output = \"/content/so-vits-svc/results/\" + wav_filename + \"_\" + trans + \"key\" + \"_\" + speaker + \".\" + wav_format\n","\n","%cd /content/so-vits-svc\n","!python inference_main.py -n {wav_filename} -m {model_path} -s {speaker} -t {trans} -cr {cluster_infer_ratio} -c {config_path} -cm {kmeans_path} -sd {slice_db} -wf {wav_format} {fmp} {apf}\n","\n","#@markdown\n","\n","#@markdown If you dont want to download from here, uncheck this.\n","download_after_inference = True #@param {type:\"boolean\"}\n","\n","if download_after_inference:\n"," from google.colab import files\n"," files.download(wav_output)"],"metadata":{"id":"dYnKuKTIj3z1"},"execution_count":null,"outputs":[]}]}
\ No newline at end of file
diff --git a/so-vits-svc/train.py b/so-vits-svc/train.py
new file mode 100644
index 0000000000000000000000000000000000000000..410f19213866f388763f0c9ac21c24c09dd5dfea
--- /dev/null
+++ b/so-vits-svc/train.py
@@ -0,0 +1,330 @@
+import logging
+import multiprocessing
+import time
+
+logging.getLogger('matplotlib').setLevel(logging.WARNING)
+logging.getLogger('numba').setLevel(logging.WARNING)
+
+import os
+import json
+import argparse
+import itertools
+import math
+import torch
+from torch import nn, optim
+from torch.nn import functional as F
+from torch.utils.data import DataLoader
+from torch.utils.tensorboard import SummaryWriter
+import torch.multiprocessing as mp
+import torch.distributed as dist
+from torch.nn.parallel import DistributedDataParallel as DDP
+from torch.cuda.amp import autocast, GradScaler
+
+import modules.commons as commons
+import utils
+from data_utils import TextAudioSpeakerLoader, TextAudioCollate
+from models import (
+ SynthesizerTrn,
+ MultiPeriodDiscriminator,
+)
+from modules.losses import (
+ kl_loss,
+ generator_loss, discriminator_loss, feature_loss
+)
+
+from modules.mel_processing import mel_spectrogram_torch, spec_to_mel_torch
+
+torch.backends.cudnn.benchmark = True
+global_step = 0
+start_time = time.time()
+
+# os.environ['TORCH_DISTRIBUTED_DEBUG'] = 'INFO'
+
+
+def main():
+ """Assume Single Node Multi GPUs Training Only"""
+ assert torch.cuda.is_available(), "CPU training is not allowed."
+ hps = utils.get_hparams()
+
+ n_gpus = torch.cuda.device_count()
+ os.environ['MASTER_ADDR'] = 'localhost'
+ os.environ['MASTER_PORT'] = hps.train.port
+
+ mp.spawn(run, nprocs=n_gpus, args=(n_gpus, hps,))
+
+
+def run(rank, n_gpus, hps):
+ global global_step
+ if rank == 0:
+ logger = utils.get_logger(hps.model_dir)
+ logger.info(hps)
+ utils.check_git_hash(hps.model_dir)
+ writer = SummaryWriter(log_dir=hps.model_dir)
+ writer_eval = SummaryWriter(log_dir=os.path.join(hps.model_dir, "eval"))
+
+ # for pytorch on win, backend use gloo
+ dist.init_process_group(backend= 'gloo' if os.name == 'nt' else 'nccl', init_method='env://', world_size=n_gpus, rank=rank)
+ torch.manual_seed(hps.train.seed)
+ torch.cuda.set_device(rank)
+ collate_fn = TextAudioCollate()
+ all_in_mem = hps.train.all_in_mem # If you have enough memory, turn on this option to avoid disk IO and speed up training.
+ train_dataset = TextAudioSpeakerLoader(hps.data.training_files, hps, all_in_mem=all_in_mem)
+ num_workers = 5 if multiprocessing.cpu_count() > 4 else multiprocessing.cpu_count()
+ if all_in_mem:
+ num_workers = 0
+ train_loader = DataLoader(train_dataset, num_workers=num_workers, shuffle=False, pin_memory=True,
+ batch_size=hps.train.batch_size, collate_fn=collate_fn)
+ if rank == 0:
+ eval_dataset = TextAudioSpeakerLoader(hps.data.validation_files, hps, all_in_mem=all_in_mem)
+ eval_loader = DataLoader(eval_dataset, num_workers=1, shuffle=False,
+ batch_size=1, pin_memory=False,
+ drop_last=False, collate_fn=collate_fn)
+
+ net_g = SynthesizerTrn(
+ hps.data.filter_length // 2 + 1,
+ hps.train.segment_size // hps.data.hop_length,
+ **hps.model).cuda(rank)
+ net_d = MultiPeriodDiscriminator(hps.model.use_spectral_norm).cuda(rank)
+ optim_g = torch.optim.AdamW(
+ net_g.parameters(),
+ hps.train.learning_rate,
+ betas=hps.train.betas,
+ eps=hps.train.eps)
+ optim_d = torch.optim.AdamW(
+ net_d.parameters(),
+ hps.train.learning_rate,
+ betas=hps.train.betas,
+ eps=hps.train.eps)
+ net_g = DDP(net_g, device_ids=[rank]) # , find_unused_parameters=True)
+ net_d = DDP(net_d, device_ids=[rank])
+
+ skip_optimizer = False
+ try:
+ _, _, _, epoch_str = utils.load_checkpoint(utils.latest_checkpoint_path(hps.model_dir, "G_*.pth"), net_g,
+ optim_g, skip_optimizer)
+ _, _, _, epoch_str = utils.load_checkpoint(utils.latest_checkpoint_path(hps.model_dir, "D_*.pth"), net_d,
+ optim_d, skip_optimizer)
+ epoch_str = max(epoch_str, 1)
+ name=utils.latest_checkpoint_path(hps.model_dir, "D_*.pth")
+ global_step=int(name[name.rfind("_")+1:name.rfind(".")])+1
+ #global_step = (epoch_str - 1) * len(train_loader)
+ except:
+ print("load old checkpoint failed...")
+ epoch_str = 1
+ global_step = 0
+ if skip_optimizer:
+ epoch_str = 1
+ global_step = 0
+
+ warmup_epoch = hps.train.warmup_epochs
+ scheduler_g = torch.optim.lr_scheduler.ExponentialLR(optim_g, gamma=hps.train.lr_decay, last_epoch=epoch_str - 2)
+ scheduler_d = torch.optim.lr_scheduler.ExponentialLR(optim_d, gamma=hps.train.lr_decay, last_epoch=epoch_str - 2)
+
+ scaler = GradScaler(enabled=hps.train.fp16_run)
+
+ for epoch in range(epoch_str, hps.train.epochs + 1):
+ # update learning rate
+ if epoch > 1:
+ scheduler_g.step()
+ scheduler_d.step()
+ # set up warm-up learning rate
+ if epoch <= warmup_epoch:
+ for param_group in optim_g.param_groups:
+ param_group['lr'] = hps.train.learning_rate / warmup_epoch * epoch
+ for param_group in optim_d.param_groups:
+ param_group['lr'] = hps.train.learning_rate / warmup_epoch * epoch
+ # training
+ if rank == 0:
+ train_and_evaluate(rank, epoch, hps, [net_g, net_d], [optim_g, optim_d], [scheduler_g, scheduler_d], scaler,
+ [train_loader, eval_loader], logger, [writer, writer_eval])
+ else:
+ train_and_evaluate(rank, epoch, hps, [net_g, net_d], [optim_g, optim_d], [scheduler_g, scheduler_d], scaler,
+ [train_loader, None], None, None)
+
+
+def train_and_evaluate(rank, epoch, hps, nets, optims, schedulers, scaler, loaders, logger, writers):
+ net_g, net_d = nets
+ optim_g, optim_d = optims
+ scheduler_g, scheduler_d = schedulers
+ train_loader, eval_loader = loaders
+ if writers is not None:
+ writer, writer_eval = writers
+
+ # train_loader.batch_sampler.set_epoch(epoch)
+ global global_step
+
+ net_g.train()
+ net_d.train()
+ for batch_idx, items in enumerate(train_loader):
+ c, f0, spec, y, spk, lengths, uv = items
+ g = spk.cuda(rank, non_blocking=True)
+ spec, y = spec.cuda(rank, non_blocking=True), y.cuda(rank, non_blocking=True)
+ c = c.cuda(rank, non_blocking=True)
+ f0 = f0.cuda(rank, non_blocking=True)
+ uv = uv.cuda(rank, non_blocking=True)
+ lengths = lengths.cuda(rank, non_blocking=True)
+ mel = spec_to_mel_torch(
+ spec,
+ hps.data.filter_length,
+ hps.data.n_mel_channels,
+ hps.data.sampling_rate,
+ hps.data.mel_fmin,
+ hps.data.mel_fmax)
+
+ with autocast(enabled=hps.train.fp16_run):
+ y_hat, ids_slice, z_mask, \
+ (z, z_p, m_p, logs_p, m_q, logs_q), pred_lf0, norm_lf0, lf0 = net_g(c, f0, uv, spec, g=g, c_lengths=lengths,
+ spec_lengths=lengths)
+
+ y_mel = commons.slice_segments(mel, ids_slice, hps.train.segment_size // hps.data.hop_length)
+ y_hat_mel = mel_spectrogram_torch(
+ y_hat.squeeze(1),
+ hps.data.filter_length,
+ hps.data.n_mel_channels,
+ hps.data.sampling_rate,
+ hps.data.hop_length,
+ hps.data.win_length,
+ hps.data.mel_fmin,
+ hps.data.mel_fmax
+ )
+ y = commons.slice_segments(y, ids_slice * hps.data.hop_length, hps.train.segment_size) # slice
+
+ # Discriminator
+ y_d_hat_r, y_d_hat_g, _, _ = net_d(y, y_hat.detach())
+
+ with autocast(enabled=False):
+ loss_disc, losses_disc_r, losses_disc_g = discriminator_loss(y_d_hat_r, y_d_hat_g)
+ loss_disc_all = loss_disc
+
+ optim_d.zero_grad()
+ scaler.scale(loss_disc_all).backward()
+ scaler.unscale_(optim_d)
+ grad_norm_d = commons.clip_grad_value_(net_d.parameters(), None)
+ scaler.step(optim_d)
+
+ with autocast(enabled=hps.train.fp16_run):
+ # Generator
+ y_d_hat_r, y_d_hat_g, fmap_r, fmap_g = net_d(y, y_hat)
+ with autocast(enabled=False):
+ loss_mel = F.l1_loss(y_mel, y_hat_mel) * hps.train.c_mel
+ loss_kl = kl_loss(z_p, logs_q, m_p, logs_p, z_mask) * hps.train.c_kl
+ loss_fm = feature_loss(fmap_r, fmap_g)
+ loss_gen, losses_gen = generator_loss(y_d_hat_g)
+ loss_lf0 = F.mse_loss(pred_lf0, lf0)
+ loss_gen_all = loss_gen + loss_fm + loss_mel + loss_kl + loss_lf0
+ optim_g.zero_grad()
+ scaler.scale(loss_gen_all).backward()
+ scaler.unscale_(optim_g)
+ grad_norm_g = commons.clip_grad_value_(net_g.parameters(), None)
+ scaler.step(optim_g)
+ scaler.update()
+
+ if rank == 0:
+ if global_step % hps.train.log_interval == 0:
+ lr = optim_g.param_groups[0]['lr']
+ losses = [loss_disc, loss_gen, loss_fm, loss_mel, loss_kl]
+ reference_loss=0
+ for i in losses:
+ reference_loss += i
+ logger.info('Train Epoch: {} [{:.0f}%]'.format(
+ epoch,
+ 100. * batch_idx / len(train_loader)))
+ logger.info(f"Losses: {[x.item() for x in losses]}, step: {global_step}, lr: {lr}, reference_loss: {reference_loss}")
+
+ scalar_dict = {"loss/g/total": loss_gen_all, "loss/d/total": loss_disc_all, "learning_rate": lr,
+ "grad_norm_d": grad_norm_d, "grad_norm_g": grad_norm_g}
+ scalar_dict.update({"loss/g/fm": loss_fm, "loss/g/mel": loss_mel, "loss/g/kl": loss_kl,
+ "loss/g/lf0": loss_lf0})
+
+ # scalar_dict.update({"loss/g/{}".format(i): v for i, v in enumerate(losses_gen)})
+ # scalar_dict.update({"loss/d_r/{}".format(i): v for i, v in enumerate(losses_disc_r)})
+ # scalar_dict.update({"loss/d_g/{}".format(i): v for i, v in enumerate(losses_disc_g)})
+ image_dict = {
+ "slice/mel_org": utils.plot_spectrogram_to_numpy(y_mel[0].data.cpu().numpy()),
+ "slice/mel_gen": utils.plot_spectrogram_to_numpy(y_hat_mel[0].data.cpu().numpy()),
+ "all/mel": utils.plot_spectrogram_to_numpy(mel[0].data.cpu().numpy()),
+ "all/lf0": utils.plot_data_to_numpy(lf0[0, 0, :].cpu().numpy(),
+ pred_lf0[0, 0, :].detach().cpu().numpy()),
+ "all/norm_lf0": utils.plot_data_to_numpy(lf0[0, 0, :].cpu().numpy(),
+ norm_lf0[0, 0, :].detach().cpu().numpy())
+ }
+
+ utils.summarize(
+ writer=writer,
+ global_step=global_step,
+ images=image_dict,
+ scalars=scalar_dict
+ )
+
+ if global_step % hps.train.eval_interval == 0:
+ evaluate(hps, net_g, eval_loader, writer_eval)
+ utils.save_checkpoint(net_g, optim_g, hps.train.learning_rate, epoch,
+ os.path.join(hps.model_dir, "G_{}.pth".format(global_step)))
+ utils.save_checkpoint(net_d, optim_d, hps.train.learning_rate, epoch,
+ os.path.join(hps.model_dir, "D_{}.pth".format(global_step)))
+ keep_ckpts = getattr(hps.train, 'keep_ckpts', 0)
+ if keep_ckpts > 0:
+ utils.clean_checkpoints(path_to_models=hps.model_dir, n_ckpts_to_keep=keep_ckpts, sort_by_time=True)
+
+ global_step += 1
+
+ if rank == 0:
+ global start_time
+ now = time.time()
+ durtaion = format(now - start_time, '.2f')
+ logger.info(f'====> Epoch: {epoch}, cost {durtaion} s')
+ start_time = now
+
+
+def evaluate(hps, generator, eval_loader, writer_eval):
+ generator.eval()
+ image_dict = {}
+ audio_dict = {}
+ with torch.no_grad():
+ for batch_idx, items in enumerate(eval_loader):
+ c, f0, spec, y, spk, _, uv = items
+ g = spk[:1].cuda(0)
+ spec, y = spec[:1].cuda(0), y[:1].cuda(0)
+ c = c[:1].cuda(0)
+ f0 = f0[:1].cuda(0)
+ uv= uv[:1].cuda(0)
+ mel = spec_to_mel_torch(
+ spec,
+ hps.data.filter_length,
+ hps.data.n_mel_channels,
+ hps.data.sampling_rate,
+ hps.data.mel_fmin,
+ hps.data.mel_fmax)
+ y_hat = generator.module.infer(c, f0, uv, g=g)
+
+ y_hat_mel = mel_spectrogram_torch(
+ y_hat.squeeze(1).float(),
+ hps.data.filter_length,
+ hps.data.n_mel_channels,
+ hps.data.sampling_rate,
+ hps.data.hop_length,
+ hps.data.win_length,
+ hps.data.mel_fmin,
+ hps.data.mel_fmax
+ )
+
+ audio_dict.update({
+ f"gen/audio_{batch_idx}": y_hat[0],
+ f"gt/audio_{batch_idx}": y[0]
+ })
+ image_dict.update({
+ f"gen/mel": utils.plot_spectrogram_to_numpy(y_hat_mel[0].cpu().numpy()),
+ "gt/mel": utils.plot_spectrogram_to_numpy(mel[0].cpu().numpy())
+ })
+ utils.summarize(
+ writer=writer_eval,
+ global_step=global_step,
+ images=image_dict,
+ audios=audio_dict,
+ audio_sampling_rate=hps.data.sampling_rate
+ )
+ generator.train()
+
+
+if __name__ == "__main__":
+ main()
diff --git a/so-vits-svc/utils.py b/so-vits-svc/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..326a6ef8c231dc5fe6b90c3efc44c86247a5f2d1
--- /dev/null
+++ b/so-vits-svc/utils.py
@@ -0,0 +1,543 @@
+import os
+import glob
+import re
+import sys
+import argparse
+import logging
+import json
+import subprocess
+import warnings
+import random
+import functools
+
+import librosa
+import numpy as np
+from scipy.io.wavfile import read
+import torch
+from torch.nn import functional as F
+from modules.commons import sequence_mask
+from hubert import hubert_model
+
+MATPLOTLIB_FLAG = False
+
+logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)
+logger = logging
+
+f0_bin = 256
+f0_max = 1100.0
+f0_min = 50.0
+f0_mel_min = 1127 * np.log(1 + f0_min / 700)
+f0_mel_max = 1127 * np.log(1 + f0_max / 700)
+
+
+# def normalize_f0(f0, random_scale=True):
+# f0_norm = f0.clone() # create a copy of the input Tensor
+# batch_size, _, frame_length = f0_norm.shape
+# for i in range(batch_size):
+# means = torch.mean(f0_norm[i, 0, :])
+# if random_scale:
+# factor = random.uniform(0.8, 1.2)
+# else:
+# factor = 1
+# f0_norm[i, 0, :] = (f0_norm[i, 0, :] - means) * factor
+# return f0_norm
+# def normalize_f0(f0, random_scale=True):
+# means = torch.mean(f0[:, 0, :], dim=1, keepdim=True)
+# if random_scale:
+# factor = torch.Tensor(f0.shape[0],1).uniform_(0.8, 1.2).to(f0.device)
+# else:
+# factor = torch.ones(f0.shape[0], 1, 1).to(f0.device)
+# f0_norm = (f0 - means.unsqueeze(-1)) * factor.unsqueeze(-1)
+# return f0_norm
+
+def deprecated(func):
+ """This is a decorator which can be used to mark functions
+ as deprecated. It will result in a warning being emitted
+ when the function is used."""
+ @functools.wraps(func)
+ def new_func(*args, **kwargs):
+ warnings.simplefilter('always', DeprecationWarning) # turn off filter
+ warnings.warn("Call to deprecated function {}.".format(func.__name__),
+ category=DeprecationWarning,
+ stacklevel=2)
+ warnings.simplefilter('default', DeprecationWarning) # reset filter
+ return func(*args, **kwargs)
+ return new_func
+
+def normalize_f0(f0, x_mask, uv, random_scale=True):
+ # calculate means based on x_mask
+ uv_sum = torch.sum(uv, dim=1, keepdim=True)
+ uv_sum[uv_sum == 0] = 9999
+ means = torch.sum(f0[:, 0, :] * uv, dim=1, keepdim=True) / uv_sum
+
+ if random_scale:
+ factor = torch.Tensor(f0.shape[0], 1).uniform_(0.8, 1.2).to(f0.device)
+ else:
+ factor = torch.ones(f0.shape[0], 1).to(f0.device)
+ # normalize f0 based on means and factor
+ f0_norm = (f0 - means.unsqueeze(-1)) * factor.unsqueeze(-1)
+ if torch.isnan(f0_norm).any():
+ exit(0)
+ return f0_norm * x_mask
+
+def compute_f0_uv_torchcrepe(wav_numpy, p_len=None, sampling_rate=44100, hop_length=512,device=None,cr_threshold=0.05):
+ from modules.crepe import CrepePitchExtractor
+ x = wav_numpy
+ if p_len is None:
+ p_len = x.shape[0]//hop_length
+ else:
+ assert abs(p_len-x.shape[0]//hop_length) < 4, "pad length error"
+
+ f0_min = 50
+ f0_max = 1100
+ F0Creper = CrepePitchExtractor(hop_length=hop_length,f0_min=f0_min,f0_max=f0_max,device=device,threshold=cr_threshold)
+ f0,uv = F0Creper(x[None,:].float(),sampling_rate,pad_to=p_len)
+ return f0,uv
+
+def plot_data_to_numpy(x, y):
+ global MATPLOTLIB_FLAG
+ if not MATPLOTLIB_FLAG:
+ import matplotlib
+ matplotlib.use("Agg")
+ MATPLOTLIB_FLAG = True
+ mpl_logger = logging.getLogger('matplotlib')
+ mpl_logger.setLevel(logging.WARNING)
+ import matplotlib.pylab as plt
+ import numpy as np
+
+ fig, ax = plt.subplots(figsize=(10, 2))
+ plt.plot(x)
+ plt.plot(y)
+ plt.tight_layout()
+
+ fig.canvas.draw()
+ data = np.fromstring(fig.canvas.tostring_rgb(), dtype=np.uint8, sep='')
+ data = data.reshape(fig.canvas.get_width_height()[::-1] + (3,))
+ plt.close()
+ return data
+
+
+
+def interpolate_f0(f0):
+
+ data = np.reshape(f0, (f0.size, 1))
+
+ vuv_vector = np.zeros((data.size, 1), dtype=np.float32)
+ vuv_vector[data > 0.0] = 1.0
+ vuv_vector[data <= 0.0] = 0.0
+
+ ip_data = data
+
+ frame_number = data.size
+ last_value = 0.0
+ for i in range(frame_number):
+ if data[i] <= 0.0:
+ j = i + 1
+ for j in range(i + 1, frame_number):
+ if data[j] > 0.0:
+ break
+ if j < frame_number - 1:
+ if last_value > 0.0:
+ step = (data[j] - data[i - 1]) / float(j - i)
+ for k in range(i, j):
+ ip_data[k] = data[i - 1] + step * (k - i + 1)
+ else:
+ for k in range(i, j):
+ ip_data[k] = data[j]
+ else:
+ for k in range(i, frame_number):
+ ip_data[k] = last_value
+ else:
+ ip_data[i] = data[i] # this may not be necessary
+ last_value = data[i]
+
+ return ip_data[:,0], vuv_vector[:,0]
+
+
+def compute_f0_parselmouth(wav_numpy, p_len=None, sampling_rate=44100, hop_length=512):
+ import parselmouth
+ x = wav_numpy
+ if p_len is None:
+ p_len = x.shape[0]//hop_length
+ else:
+ assert abs(p_len-x.shape[0]//hop_length) < 4, "pad length error"
+ time_step = hop_length / sampling_rate * 1000
+ f0_min = 50
+ f0_max = 1100
+ f0 = parselmouth.Sound(x, sampling_rate).to_pitch_ac(
+ time_step=time_step / 1000, voicing_threshold=0.6,
+ pitch_floor=f0_min, pitch_ceiling=f0_max).selected_array['frequency']
+
+ pad_size=(p_len - len(f0) + 1) // 2
+ if(pad_size>0 or p_len - len(f0) - pad_size>0):
+ f0 = np.pad(f0,[[pad_size,p_len - len(f0) - pad_size]], mode='constant')
+ return f0
+
+def resize_f0(x, target_len):
+ source = np.array(x)
+ source[source<0.001] = np.nan
+ target = np.interp(np.arange(0, len(source)*target_len, len(source))/ target_len, np.arange(0, len(source)), source)
+ res = np.nan_to_num(target)
+ return res
+
+def compute_f0_dio(wav_numpy, p_len=None, sampling_rate=44100, hop_length=512):
+ import pyworld
+ if p_len is None:
+ p_len = wav_numpy.shape[0]//hop_length
+ f0, t = pyworld.dio(
+ wav_numpy.astype(np.double),
+ fs=sampling_rate,
+ f0_ceil=800,
+ frame_period=1000 * hop_length / sampling_rate,
+ )
+ f0 = pyworld.stonemask(wav_numpy.astype(np.double), f0, t, sampling_rate)
+ for index, pitch in enumerate(f0):
+ f0[index] = round(pitch, 1)
+ return resize_f0(f0, p_len)
+
+def f0_to_coarse(f0):
+ is_torch = isinstance(f0, torch.Tensor)
+ f0_mel = 1127 * (1 + f0 / 700).log() if is_torch else 1127 * np.log(1 + f0 / 700)
+ f0_mel[f0_mel > 0] = (f0_mel[f0_mel > 0] - f0_mel_min) * (f0_bin - 2) / (f0_mel_max - f0_mel_min) + 1
+
+ f0_mel[f0_mel <= 1] = 1
+ f0_mel[f0_mel > f0_bin - 1] = f0_bin - 1
+ f0_coarse = (f0_mel + 0.5).int() if is_torch else np.rint(f0_mel).astype(np.int)
+ assert f0_coarse.max() <= 255 and f0_coarse.min() >= 1, (f0_coarse.max(), f0_coarse.min())
+ return f0_coarse
+
+
+def get_hubert_model():
+ vec_path = "hubert/checkpoint_best_legacy_500.pt"
+ print("load model(s) from {}".format(vec_path))
+ from fairseq import checkpoint_utils
+ models, saved_cfg, task = checkpoint_utils.load_model_ensemble_and_task(
+ [vec_path],
+ suffix="",
+ )
+ model = models[0]
+ model.eval()
+ return model
+
+def get_hubert_content(hmodel, wav_16k_tensor):
+ feats = wav_16k_tensor
+ if feats.dim() == 2: # double channels
+ feats = feats.mean(-1)
+ assert feats.dim() == 1, feats.dim()
+ feats = feats.view(1, -1)
+ padding_mask = torch.BoolTensor(feats.shape).fill_(False)
+ inputs = {
+ "source": feats.to(wav_16k_tensor.device),
+ "padding_mask": padding_mask.to(wav_16k_tensor.device),
+ "output_layer": 9, # layer 9
+ }
+ with torch.no_grad():
+ logits = hmodel.extract_features(**inputs)
+ feats = hmodel.final_proj(logits[0])
+ return feats.transpose(1, 2)
+
+
+def get_content(cmodel, y):
+ with torch.no_grad():
+ c = cmodel.extract_features(y.squeeze(1))[0]
+ c = c.transpose(1, 2)
+ return c
+
+
+
+def load_checkpoint(checkpoint_path, model, optimizer=None, skip_optimizer=False):
+ assert os.path.isfile(checkpoint_path)
+ checkpoint_dict = torch.load(checkpoint_path, map_location='cpu')
+ iteration = checkpoint_dict['iteration']
+ learning_rate = checkpoint_dict['learning_rate']
+ if optimizer is not None and not skip_optimizer and checkpoint_dict['optimizer'] is not None:
+ optimizer.load_state_dict(checkpoint_dict['optimizer'])
+ saved_state_dict = checkpoint_dict['model']
+ if hasattr(model, 'module'):
+ state_dict = model.module.state_dict()
+ else:
+ state_dict = model.state_dict()
+ new_state_dict = {}
+ for k, v in state_dict.items():
+ try:
+ # assert "dec" in k or "disc" in k
+ # print("load", k)
+ new_state_dict[k] = saved_state_dict[k]
+ assert saved_state_dict[k].shape == v.shape, (saved_state_dict[k].shape, v.shape)
+ except:
+ print("error, %s is not in the checkpoint" % k)
+ logger.info("%s is not in the checkpoint" % k)
+ new_state_dict[k] = v
+ if hasattr(model, 'module'):
+ model.module.load_state_dict(new_state_dict)
+ else:
+ model.load_state_dict(new_state_dict)
+ print("load ")
+ logger.info("Loaded checkpoint '{}' (iteration {})".format(
+ checkpoint_path, iteration))
+ return model, optimizer, learning_rate, iteration
+
+
+def save_checkpoint(model, optimizer, learning_rate, iteration, checkpoint_path):
+ logger.info("Saving model and optimizer state at iteration {} to {}".format(
+ iteration, checkpoint_path))
+ if hasattr(model, 'module'):
+ state_dict = model.module.state_dict()
+ else:
+ state_dict = model.state_dict()
+ torch.save({'model': state_dict,
+ 'iteration': iteration,
+ 'optimizer': optimizer.state_dict(),
+ 'learning_rate': learning_rate}, checkpoint_path)
+
+def clean_checkpoints(path_to_models='logs/44k/', n_ckpts_to_keep=2, sort_by_time=True):
+ """Freeing up space by deleting saved ckpts
+
+ Arguments:
+ path_to_models -- Path to the model directory
+ n_ckpts_to_keep -- Number of ckpts to keep, excluding G_0.pth and D_0.pth
+ sort_by_time -- True -> chronologically delete ckpts
+ False -> lexicographically delete ckpts
+ """
+ ckpts_files = [f for f in os.listdir(path_to_models) if os.path.isfile(os.path.join(path_to_models, f))]
+ name_key = (lambda _f: int(re.compile('._(\d+)\.pth').match(_f).group(1)))
+ time_key = (lambda _f: os.path.getmtime(os.path.join(path_to_models, _f)))
+ sort_key = time_key if sort_by_time else name_key
+ x_sorted = lambda _x: sorted([f for f in ckpts_files if f.startswith(_x) and not f.endswith('_0.pth')], key=sort_key)
+ to_del = [os.path.join(path_to_models, fn) for fn in
+ (x_sorted('G')[:-n_ckpts_to_keep] + x_sorted('D')[:-n_ckpts_to_keep])]
+ del_info = lambda fn: logger.info(f".. Free up space by deleting ckpt {fn}")
+ del_routine = lambda x: [os.remove(x), del_info(x)]
+ rs = [del_routine(fn) for fn in to_del]
+
+def summarize(writer, global_step, scalars={}, histograms={}, images={}, audios={}, audio_sampling_rate=22050):
+ for k, v in scalars.items():
+ writer.add_scalar(k, v, global_step)
+ for k, v in histograms.items():
+ writer.add_histogram(k, v, global_step)
+ for k, v in images.items():
+ writer.add_image(k, v, global_step, dataformats='HWC')
+ for k, v in audios.items():
+ writer.add_audio(k, v, global_step, audio_sampling_rate)
+
+
+def latest_checkpoint_path(dir_path, regex="G_*.pth"):
+ f_list = glob.glob(os.path.join(dir_path, regex))
+ f_list.sort(key=lambda f: int("".join(filter(str.isdigit, f))))
+ x = f_list[-1]
+ print(x)
+ return x
+
+
+def plot_spectrogram_to_numpy(spectrogram):
+ global MATPLOTLIB_FLAG
+ if not MATPLOTLIB_FLAG:
+ import matplotlib
+ matplotlib.use("Agg")
+ MATPLOTLIB_FLAG = True
+ mpl_logger = logging.getLogger('matplotlib')
+ mpl_logger.setLevel(logging.WARNING)
+ import matplotlib.pylab as plt
+ import numpy as np
+
+ fig, ax = plt.subplots(figsize=(10,2))
+ im = ax.imshow(spectrogram, aspect="auto", origin="lower",
+ interpolation='none')
+ plt.colorbar(im, ax=ax)
+ plt.xlabel("Frames")
+ plt.ylabel("Channels")
+ plt.tight_layout()
+
+ fig.canvas.draw()
+ data = np.fromstring(fig.canvas.tostring_rgb(), dtype=np.uint8, sep='')
+ data = data.reshape(fig.canvas.get_width_height()[::-1] + (3,))
+ plt.close()
+ return data
+
+
+def plot_alignment_to_numpy(alignment, info=None):
+ global MATPLOTLIB_FLAG
+ if not MATPLOTLIB_FLAG:
+ import matplotlib
+ matplotlib.use("Agg")
+ MATPLOTLIB_FLAG = True
+ mpl_logger = logging.getLogger('matplotlib')
+ mpl_logger.setLevel(logging.WARNING)
+ import matplotlib.pylab as plt
+ import numpy as np
+
+ fig, ax = plt.subplots(figsize=(6, 4))
+ im = ax.imshow(alignment.transpose(), aspect='auto', origin='lower',
+ interpolation='none')
+ fig.colorbar(im, ax=ax)
+ xlabel = 'Decoder timestep'
+ if info is not None:
+ xlabel += '\n\n' + info
+ plt.xlabel(xlabel)
+ plt.ylabel('Encoder timestep')
+ plt.tight_layout()
+
+ fig.canvas.draw()
+ data = np.fromstring(fig.canvas.tostring_rgb(), dtype=np.uint8, sep='')
+ data = data.reshape(fig.canvas.get_width_height()[::-1] + (3,))
+ plt.close()
+ return data
+
+
+def load_wav_to_torch(full_path):
+ sampling_rate, data = read(full_path)
+ return torch.FloatTensor(data.astype(np.float32)), sampling_rate
+
+
+def load_filepaths_and_text(filename, split="|"):
+ with open(filename, encoding='utf-8') as f:
+ filepaths_and_text = [line.strip().split(split) for line in f]
+ return filepaths_and_text
+
+
+def get_hparams(init=True):
+ parser = argparse.ArgumentParser()
+ parser.add_argument('-c', '--config', type=str, default="./configs/base.json",
+ help='JSON file for configuration')
+ parser.add_argument('-m', '--model', type=str, required=True,
+ help='Model name')
+
+ args = parser.parse_args()
+ model_dir = os.path.join("./logs", args.model)
+
+ if not os.path.exists(model_dir):
+ os.makedirs(model_dir)
+
+ config_path = args.config
+ config_save_path = os.path.join(model_dir, "config.json")
+ if init:
+ with open(config_path, "r") as f:
+ data = f.read()
+ with open(config_save_path, "w") as f:
+ f.write(data)
+ else:
+ with open(config_save_path, "r") as f:
+ data = f.read()
+ config = json.loads(data)
+
+ hparams = HParams(**config)
+ hparams.model_dir = model_dir
+ return hparams
+
+
+def get_hparams_from_dir(model_dir):
+ config_save_path = os.path.join(model_dir, "config.json")
+ with open(config_save_path, "r") as f:
+ data = f.read()
+ config = json.loads(data)
+
+ hparams =HParams(**config)
+ hparams.model_dir = model_dir
+ return hparams
+
+
+def get_hparams_from_file(config_path):
+ with open(config_path, "r") as f:
+ data = f.read()
+ config = json.loads(data)
+
+ hparams =HParams(**config)
+ return hparams
+
+
+def check_git_hash(model_dir):
+ source_dir = os.path.dirname(os.path.realpath(__file__))
+ if not os.path.exists(os.path.join(source_dir, ".git")):
+ logger.warn("{} is not a git repository, therefore hash value comparison will be ignored.".format(
+ source_dir
+ ))
+ return
+
+ cur_hash = subprocess.getoutput("git rev-parse HEAD")
+
+ path = os.path.join(model_dir, "githash")
+ if os.path.exists(path):
+ saved_hash = open(path).read()
+ if saved_hash != cur_hash:
+ logger.warn("git hash values are different. {}(saved) != {}(current)".format(
+ saved_hash[:8], cur_hash[:8]))
+ else:
+ open(path, "w").write(cur_hash)
+
+
+def get_logger(model_dir, filename="train.log"):
+ global logger
+ logger = logging.getLogger(os.path.basename(model_dir))
+ logger.setLevel(logging.DEBUG)
+
+ formatter = logging.Formatter("%(asctime)s\t%(name)s\t%(levelname)s\t%(message)s")
+ if not os.path.exists(model_dir):
+ os.makedirs(model_dir)
+ h = logging.FileHandler(os.path.join(model_dir, filename))
+ h.setLevel(logging.DEBUG)
+ h.setFormatter(formatter)
+ logger.addHandler(h)
+ return logger
+
+
+def repeat_expand_2d(content, target_len):
+ # content : [h, t]
+
+ src_len = content.shape[-1]
+ target = torch.zeros([content.shape[0], target_len], dtype=torch.float).to(content.device)
+ temp = torch.arange(src_len+1) * target_len / src_len
+ current_pos = 0
+ for i in range(target_len):
+ if i < temp[current_pos+1]:
+ target[:, i] = content[:, current_pos]
+ else:
+ current_pos += 1
+ target[:, i] = content[:, current_pos]
+
+ return target
+
+
+def mix_model(model_paths,mix_rate,mode):
+ mix_rate = torch.FloatTensor(mix_rate)/100
+ model_tem = torch.load(model_paths[0])
+ models = [torch.load(path)["model"] for path in model_paths]
+ if mode == 0:
+ mix_rate = F.softmax(mix_rate,dim=0)
+ for k in model_tem["model"].keys():
+ model_tem["model"][k] = torch.zeros_like(model_tem["model"][k])
+ for i,model in enumerate(models):
+ model_tem["model"][k] += model[k]*mix_rate[i]
+ torch.save(model_tem,os.path.join(os.path.curdir,"output.pth"))
+ return os.path.join(os.path.curdir,"output.pth")
+
+class HParams():
+ def __init__(self, **kwargs):
+ for k, v in kwargs.items():
+ if type(v) == dict:
+ v = HParams(**v)
+ self[k] = v
+
+ def keys(self):
+ return self.__dict__.keys()
+
+ def items(self):
+ return self.__dict__.items()
+
+ def values(self):
+ return self.__dict__.values()
+
+ def __len__(self):
+ return len(self.__dict__)
+
+ def __getitem__(self, key):
+ return getattr(self, key)
+
+ def __setitem__(self, key, value):
+ return setattr(self, key, value)
+
+ def __contains__(self, key):
+ return key in self.__dict__
+
+ def __repr__(self):
+ return self.__dict__.__repr__()
+
diff --git a/so-vits-svc/vdecoder/__init__.py b/so-vits-svc/vdecoder/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/so-vits-svc/vdecoder/hifigan/__pycache__/env.cpython-38.pyc b/so-vits-svc/vdecoder/hifigan/__pycache__/env.cpython-38.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..38b9f3e6c2d958b379a5189ba3beabeae0cef087
Binary files /dev/null and b/so-vits-svc/vdecoder/hifigan/__pycache__/env.cpython-38.pyc differ
diff --git a/so-vits-svc/vdecoder/hifigan/__pycache__/models.cpython-38.pyc b/so-vits-svc/vdecoder/hifigan/__pycache__/models.cpython-38.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..0730c487cd0de506d0c6b23b6b26460eab40f282
Binary files /dev/null and b/so-vits-svc/vdecoder/hifigan/__pycache__/models.cpython-38.pyc differ
diff --git a/so-vits-svc/vdecoder/hifigan/__pycache__/utils.cpython-38.pyc b/so-vits-svc/vdecoder/hifigan/__pycache__/utils.cpython-38.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..650ee5dae0e1f58d7e1a9752654c39e6790f8d58
Binary files /dev/null and b/so-vits-svc/vdecoder/hifigan/__pycache__/utils.cpython-38.pyc differ
diff --git a/so-vits-svc/vdecoder/hifigan/env.py b/so-vits-svc/vdecoder/hifigan/env.py
new file mode 100644
index 0000000000000000000000000000000000000000..2bdbc95d4f7a8bad8fd4f5eef657e2b51d946056
--- /dev/null
+++ b/so-vits-svc/vdecoder/hifigan/env.py
@@ -0,0 +1,15 @@
+import os
+import shutil
+
+
+class AttrDict(dict):
+ def __init__(self, *args, **kwargs):
+ super(AttrDict, self).__init__(*args, **kwargs)
+ self.__dict__ = self
+
+
+def build_env(config, config_name, path):
+ t_path = os.path.join(path, config_name)
+ if config != t_path:
+ os.makedirs(path, exist_ok=True)
+ shutil.copyfile(config, os.path.join(path, config_name))
diff --git a/so-vits-svc/vdecoder/hifigan/models.py b/so-vits-svc/vdecoder/hifigan/models.py
new file mode 100644
index 0000000000000000000000000000000000000000..9747301f350bb269e62601017fe4633ce271b27e
--- /dev/null
+++ b/so-vits-svc/vdecoder/hifigan/models.py
@@ -0,0 +1,503 @@
+import os
+import json
+from .env import AttrDict
+import numpy as np
+import torch
+import torch.nn.functional as F
+import torch.nn as nn
+from torch.nn import Conv1d, ConvTranspose1d, AvgPool1d, Conv2d
+from torch.nn.utils import weight_norm, remove_weight_norm, spectral_norm
+from .utils import init_weights, get_padding
+
+LRELU_SLOPE = 0.1
+
+
+def load_model(model_path, device='cuda'):
+ config_file = os.path.join(os.path.split(model_path)[0], 'config.json')
+ with open(config_file) as f:
+ data = f.read()
+
+ global h
+ json_config = json.loads(data)
+ h = AttrDict(json_config)
+
+ generator = Generator(h).to(device)
+
+ cp_dict = torch.load(model_path)
+ generator.load_state_dict(cp_dict['generator'])
+ generator.eval()
+ generator.remove_weight_norm()
+ del cp_dict
+ return generator, h
+
+
+class ResBlock1(torch.nn.Module):
+ def __init__(self, h, channels, kernel_size=3, dilation=(1, 3, 5)):
+ super(ResBlock1, self).__init__()
+ self.h = h
+ self.convs1 = nn.ModuleList([
+ weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[0],
+ padding=get_padding(kernel_size, dilation[0]))),
+ weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[1],
+ padding=get_padding(kernel_size, dilation[1]))),
+ weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[2],
+ padding=get_padding(kernel_size, dilation[2])))
+ ])
+ self.convs1.apply(init_weights)
+
+ self.convs2 = nn.ModuleList([
+ weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=1,
+ padding=get_padding(kernel_size, 1))),
+ weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=1,
+ padding=get_padding(kernel_size, 1))),
+ weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=1,
+ padding=get_padding(kernel_size, 1)))
+ ])
+ self.convs2.apply(init_weights)
+
+ def forward(self, x):
+ for c1, c2 in zip(self.convs1, self.convs2):
+ xt = F.leaky_relu(x, LRELU_SLOPE)
+ xt = c1(xt)
+ xt = F.leaky_relu(xt, LRELU_SLOPE)
+ xt = c2(xt)
+ x = xt + x
+ return x
+
+ def remove_weight_norm(self):
+ for l in self.convs1:
+ remove_weight_norm(l)
+ for l in self.convs2:
+ remove_weight_norm(l)
+
+
+class ResBlock2(torch.nn.Module):
+ def __init__(self, h, channels, kernel_size=3, dilation=(1, 3)):
+ super(ResBlock2, self).__init__()
+ self.h = h
+ self.convs = nn.ModuleList([
+ weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[0],
+ padding=get_padding(kernel_size, dilation[0]))),
+ weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[1],
+ padding=get_padding(kernel_size, dilation[1])))
+ ])
+ self.convs.apply(init_weights)
+
+ def forward(self, x):
+ for c in self.convs:
+ xt = F.leaky_relu(x, LRELU_SLOPE)
+ xt = c(xt)
+ x = xt + x
+ return x
+
+ def remove_weight_norm(self):
+ for l in self.convs:
+ remove_weight_norm(l)
+
+
+def padDiff(x):
+ return F.pad(F.pad(x, (0,0,-1,1), 'constant', 0) - x, (0,0,0,-1), 'constant', 0)
+
+class SineGen(torch.nn.Module):
+ """ Definition of sine generator
+ SineGen(samp_rate, harmonic_num = 0,
+ sine_amp = 0.1, noise_std = 0.003,
+ voiced_threshold = 0,
+ flag_for_pulse=False)
+ samp_rate: sampling rate in Hz
+ harmonic_num: number of harmonic overtones (default 0)
+ sine_amp: amplitude of sine-wavefrom (default 0.1)
+ noise_std: std of Gaussian noise (default 0.003)
+ voiced_thoreshold: F0 threshold for U/V classification (default 0)
+ flag_for_pulse: this SinGen is used inside PulseGen (default False)
+ Note: when flag_for_pulse is True, the first time step of a voiced
+ segment is always sin(np.pi) or cos(0)
+ """
+
+ def __init__(self, samp_rate, harmonic_num=0,
+ sine_amp=0.1, noise_std=0.003,
+ voiced_threshold=0,
+ flag_for_pulse=False):
+ super(SineGen, self).__init__()
+ self.sine_amp = sine_amp
+ self.noise_std = noise_std
+ self.harmonic_num = harmonic_num
+ self.dim = self.harmonic_num + 1
+ self.sampling_rate = samp_rate
+ self.voiced_threshold = voiced_threshold
+ self.flag_for_pulse = flag_for_pulse
+
+ def _f02uv(self, f0):
+ # generate uv signal
+ uv = (f0 > self.voiced_threshold).type(torch.float32)
+ return uv
+
+ def _f02sine(self, f0_values):
+ """ f0_values: (batchsize, length, dim)
+ where dim indicates fundamental tone and overtones
+ """
+ # convert to F0 in rad. The interger part n can be ignored
+ # because 2 * np.pi * n doesn't affect phase
+ rad_values = (f0_values / self.sampling_rate) % 1
+
+ # initial phase noise (no noise for fundamental component)
+ rand_ini = torch.rand(f0_values.shape[0], f0_values.shape[2], \
+ device=f0_values.device)
+ rand_ini[:, 0] = 0
+ rad_values[:, 0, :] = rad_values[:, 0, :] + rand_ini
+
+ # instantanouse phase sine[t] = sin(2*pi \sum_i=1 ^{t} rad)
+ if not self.flag_for_pulse:
+ # for normal case
+
+ # To prevent torch.cumsum numerical overflow,
+ # it is necessary to add -1 whenever \sum_k=1^n rad_value_k > 1.
+ # Buffer tmp_over_one_idx indicates the time step to add -1.
+ # This will not change F0 of sine because (x-1) * 2*pi = x * 2*pi
+ tmp_over_one = torch.cumsum(rad_values, 1) % 1
+ tmp_over_one_idx = (padDiff(tmp_over_one)) < 0
+ cumsum_shift = torch.zeros_like(rad_values)
+ cumsum_shift[:, 1:, :] = tmp_over_one_idx * -1.0
+
+ sines = torch.sin(torch.cumsum(rad_values + cumsum_shift, dim=1)
+ * 2 * np.pi)
+ else:
+ # If necessary, make sure that the first time step of every
+ # voiced segments is sin(pi) or cos(0)
+ # This is used for pulse-train generation
+
+ # identify the last time step in unvoiced segments
+ uv = self._f02uv(f0_values)
+ uv_1 = torch.roll(uv, shifts=-1, dims=1)
+ uv_1[:, -1, :] = 1
+ u_loc = (uv < 1) * (uv_1 > 0)
+
+ # get the instantanouse phase
+ tmp_cumsum = torch.cumsum(rad_values, dim=1)
+ # different batch needs to be processed differently
+ for idx in range(f0_values.shape[0]):
+ temp_sum = tmp_cumsum[idx, u_loc[idx, :, 0], :]
+ temp_sum[1:, :] = temp_sum[1:, :] - temp_sum[0:-1, :]
+ # stores the accumulation of i.phase within
+ # each voiced segments
+ tmp_cumsum[idx, :, :] = 0
+ tmp_cumsum[idx, u_loc[idx, :, 0], :] = temp_sum
+
+ # rad_values - tmp_cumsum: remove the accumulation of i.phase
+ # within the previous voiced segment.
+ i_phase = torch.cumsum(rad_values - tmp_cumsum, dim=1)
+
+ # get the sines
+ sines = torch.cos(i_phase * 2 * np.pi)
+ return sines
+
+ def forward(self, f0):
+ """ sine_tensor, uv = forward(f0)
+ input F0: tensor(batchsize=1, length, dim=1)
+ f0 for unvoiced steps should be 0
+ output sine_tensor: tensor(batchsize=1, length, dim)
+ output uv: tensor(batchsize=1, length, 1)
+ """
+ with torch.no_grad():
+ f0_buf = torch.zeros(f0.shape[0], f0.shape[1], self.dim,
+ device=f0.device)
+ # fundamental component
+ fn = torch.multiply(f0, torch.FloatTensor([[range(1, self.harmonic_num + 2)]]).to(f0.device))
+
+ # generate sine waveforms
+ sine_waves = self._f02sine(fn) * self.sine_amp
+
+ # generate uv signal
+ # uv = torch.ones(f0.shape)
+ # uv = uv * (f0 > self.voiced_threshold)
+ uv = self._f02uv(f0)
+
+ # noise: for unvoiced should be similar to sine_amp
+ # std = self.sine_amp/3 -> max value ~ self.sine_amp
+ # . for voiced regions is self.noise_std
+ noise_amp = uv * self.noise_std + (1 - uv) * self.sine_amp / 3
+ noise = noise_amp * torch.randn_like(sine_waves)
+
+ # first: set the unvoiced part to 0 by uv
+ # then: additive noise
+ sine_waves = sine_waves * uv + noise
+ return sine_waves, uv, noise
+
+
+class SourceModuleHnNSF(torch.nn.Module):
+ """ SourceModule for hn-nsf
+ SourceModule(sampling_rate, harmonic_num=0, sine_amp=0.1,
+ add_noise_std=0.003, voiced_threshod=0)
+ sampling_rate: sampling_rate in Hz
+ harmonic_num: number of harmonic above F0 (default: 0)
+ sine_amp: amplitude of sine source signal (default: 0.1)
+ add_noise_std: std of additive Gaussian noise (default: 0.003)
+ note that amplitude of noise in unvoiced is decided
+ by sine_amp
+ voiced_threshold: threhold to set U/V given F0 (default: 0)
+ Sine_source, noise_source = SourceModuleHnNSF(F0_sampled)
+ F0_sampled (batchsize, length, 1)
+ Sine_source (batchsize, length, 1)
+ noise_source (batchsize, length 1)
+ uv (batchsize, length, 1)
+ """
+
+ def __init__(self, sampling_rate, harmonic_num=0, sine_amp=0.1,
+ add_noise_std=0.003, voiced_threshod=0):
+ super(SourceModuleHnNSF, self).__init__()
+
+ self.sine_amp = sine_amp
+ self.noise_std = add_noise_std
+
+ # to produce sine waveforms
+ self.l_sin_gen = SineGen(sampling_rate, harmonic_num,
+ sine_amp, add_noise_std, voiced_threshod)
+
+ # to merge source harmonics into a single excitation
+ self.l_linear = torch.nn.Linear(harmonic_num + 1, 1)
+ self.l_tanh = torch.nn.Tanh()
+
+ def forward(self, x):
+ """
+ Sine_source, noise_source = SourceModuleHnNSF(F0_sampled)
+ F0_sampled (batchsize, length, 1)
+ Sine_source (batchsize, length, 1)
+ noise_source (batchsize, length 1)
+ """
+ # source for harmonic branch
+ sine_wavs, uv, _ = self.l_sin_gen(x)
+ sine_merge = self.l_tanh(self.l_linear(sine_wavs))
+
+ # source for noise branch, in the same shape as uv
+ noise = torch.randn_like(uv) * self.sine_amp / 3
+ return sine_merge, noise, uv
+
+
+class Generator(torch.nn.Module):
+ def __init__(self, h):
+ super(Generator, self).__init__()
+ self.h = h
+
+ self.num_kernels = len(h["resblock_kernel_sizes"])
+ self.num_upsamples = len(h["upsample_rates"])
+ self.f0_upsamp = torch.nn.Upsample(scale_factor=np.prod(h["upsample_rates"]))
+ self.m_source = SourceModuleHnNSF(
+ sampling_rate=h["sampling_rate"],
+ harmonic_num=8)
+ self.noise_convs = nn.ModuleList()
+ self.conv_pre = weight_norm(Conv1d(h["inter_channels"], h["upsample_initial_channel"], 7, 1, padding=3))
+ resblock = ResBlock1 if h["resblock"] == '1' else ResBlock2
+ self.ups = nn.ModuleList()
+ for i, (u, k) in enumerate(zip(h["upsample_rates"], h["upsample_kernel_sizes"])):
+ c_cur = h["upsample_initial_channel"] // (2 ** (i + 1))
+ self.ups.append(weight_norm(
+ ConvTranspose1d(h["upsample_initial_channel"] // (2 ** i), h["upsample_initial_channel"] // (2 ** (i + 1)),
+ k, u, padding=(k - u) // 2)))
+ if i + 1 < len(h["upsample_rates"]): #
+ stride_f0 = np.prod(h["upsample_rates"][i + 1:])
+ self.noise_convs.append(Conv1d(
+ 1, c_cur, kernel_size=stride_f0 * 2, stride=stride_f0, padding=stride_f0 // 2))
+ else:
+ self.noise_convs.append(Conv1d(1, c_cur, kernel_size=1))
+ self.resblocks = nn.ModuleList()
+ for i in range(len(self.ups)):
+ ch = h["upsample_initial_channel"] // (2 ** (i + 1))
+ for j, (k, d) in enumerate(zip(h["resblock_kernel_sizes"], h["resblock_dilation_sizes"])):
+ self.resblocks.append(resblock(h, ch, k, d))
+
+ self.conv_post = weight_norm(Conv1d(ch, 1, 7, 1, padding=3))
+ self.ups.apply(init_weights)
+ self.conv_post.apply(init_weights)
+ self.cond = nn.Conv1d(h['gin_channels'], h['upsample_initial_channel'], 1)
+
+ def forward(self, x, f0, g=None):
+ # print(1,x.shape,f0.shape,f0[:, None].shape)
+ f0 = self.f0_upsamp(f0[:, None]).transpose(1, 2) # bs,n,t
+ # print(2,f0.shape)
+ har_source, noi_source, uv = self.m_source(f0)
+ har_source = har_source.transpose(1, 2)
+ x = self.conv_pre(x)
+ x = x + self.cond(g)
+ # print(124,x.shape,har_source.shape)
+ for i in range(self.num_upsamples):
+ x = F.leaky_relu(x, LRELU_SLOPE)
+ # print(3,x.shape)
+ x = self.ups[i](x)
+ x_source = self.noise_convs[i](har_source)
+ # print(4,x_source.shape,har_source.shape,x.shape)
+ x = x + x_source
+ xs = None
+ for j in range(self.num_kernels):
+ if xs is None:
+ xs = self.resblocks[i * self.num_kernels + j](x)
+ else:
+ xs += self.resblocks[i * self.num_kernels + j](x)
+ x = xs / self.num_kernels
+ x = F.leaky_relu(x)
+ x = self.conv_post(x)
+ x = torch.tanh(x)
+
+ return x
+
+ def remove_weight_norm(self):
+ print('Removing weight norm...')
+ for l in self.ups:
+ remove_weight_norm(l)
+ for l in self.resblocks:
+ l.remove_weight_norm()
+ remove_weight_norm(self.conv_pre)
+ remove_weight_norm(self.conv_post)
+
+
+class DiscriminatorP(torch.nn.Module):
+ def __init__(self, period, kernel_size=5, stride=3, use_spectral_norm=False):
+ super(DiscriminatorP, self).__init__()
+ self.period = period
+ norm_f = weight_norm if use_spectral_norm == False else spectral_norm
+ self.convs = nn.ModuleList([
+ norm_f(Conv2d(1, 32, (kernel_size, 1), (stride, 1), padding=(get_padding(5, 1), 0))),
+ norm_f(Conv2d(32, 128, (kernel_size, 1), (stride, 1), padding=(get_padding(5, 1), 0))),
+ norm_f(Conv2d(128, 512, (kernel_size, 1), (stride, 1), padding=(get_padding(5, 1), 0))),
+ norm_f(Conv2d(512, 1024, (kernel_size, 1), (stride, 1), padding=(get_padding(5, 1), 0))),
+ norm_f(Conv2d(1024, 1024, (kernel_size, 1), 1, padding=(2, 0))),
+ ])
+ self.conv_post = norm_f(Conv2d(1024, 1, (3, 1), 1, padding=(1, 0)))
+
+ def forward(self, x):
+ fmap = []
+
+ # 1d to 2d
+ b, c, t = x.shape
+ if t % self.period != 0: # pad first
+ n_pad = self.period - (t % self.period)
+ x = F.pad(x, (0, n_pad), "reflect")
+ t = t + n_pad
+ x = x.view(b, c, t // self.period, self.period)
+
+ for l in self.convs:
+ x = l(x)
+ x = F.leaky_relu(x, LRELU_SLOPE)
+ fmap.append(x)
+ x = self.conv_post(x)
+ fmap.append(x)
+ x = torch.flatten(x, 1, -1)
+
+ return x, fmap
+
+
+class MultiPeriodDiscriminator(torch.nn.Module):
+ def __init__(self, periods=None):
+ super(MultiPeriodDiscriminator, self).__init__()
+ self.periods = periods if periods is not None else [2, 3, 5, 7, 11]
+ self.discriminators = nn.ModuleList()
+ for period in self.periods:
+ self.discriminators.append(DiscriminatorP(period))
+
+ def forward(self, y, y_hat):
+ y_d_rs = []
+ y_d_gs = []
+ fmap_rs = []
+ fmap_gs = []
+ for i, d in enumerate(self.discriminators):
+ y_d_r, fmap_r = d(y)
+ y_d_g, fmap_g = d(y_hat)
+ y_d_rs.append(y_d_r)
+ fmap_rs.append(fmap_r)
+ y_d_gs.append(y_d_g)
+ fmap_gs.append(fmap_g)
+
+ return y_d_rs, y_d_gs, fmap_rs, fmap_gs
+
+
+class DiscriminatorS(torch.nn.Module):
+ def __init__(self, use_spectral_norm=False):
+ super(DiscriminatorS, self).__init__()
+ norm_f = weight_norm if use_spectral_norm == False else spectral_norm
+ self.convs = nn.ModuleList([
+ norm_f(Conv1d(1, 128, 15, 1, padding=7)),
+ norm_f(Conv1d(128, 128, 41, 2, groups=4, padding=20)),
+ norm_f(Conv1d(128, 256, 41, 2, groups=16, padding=20)),
+ norm_f(Conv1d(256, 512, 41, 4, groups=16, padding=20)),
+ norm_f(Conv1d(512, 1024, 41, 4, groups=16, padding=20)),
+ norm_f(Conv1d(1024, 1024, 41, 1, groups=16, padding=20)),
+ norm_f(Conv1d(1024, 1024, 5, 1, padding=2)),
+ ])
+ self.conv_post = norm_f(Conv1d(1024, 1, 3, 1, padding=1))
+
+ def forward(self, x):
+ fmap = []
+ for l in self.convs:
+ x = l(x)
+ x = F.leaky_relu(x, LRELU_SLOPE)
+ fmap.append(x)
+ x = self.conv_post(x)
+ fmap.append(x)
+ x = torch.flatten(x, 1, -1)
+
+ return x, fmap
+
+
+class MultiScaleDiscriminator(torch.nn.Module):
+ def __init__(self):
+ super(MultiScaleDiscriminator, self).__init__()
+ self.discriminators = nn.ModuleList([
+ DiscriminatorS(use_spectral_norm=True),
+ DiscriminatorS(),
+ DiscriminatorS(),
+ ])
+ self.meanpools = nn.ModuleList([
+ AvgPool1d(4, 2, padding=2),
+ AvgPool1d(4, 2, padding=2)
+ ])
+
+ def forward(self, y, y_hat):
+ y_d_rs = []
+ y_d_gs = []
+ fmap_rs = []
+ fmap_gs = []
+ for i, d in enumerate(self.discriminators):
+ if i != 0:
+ y = self.meanpools[i - 1](y)
+ y_hat = self.meanpools[i - 1](y_hat)
+ y_d_r, fmap_r = d(y)
+ y_d_g, fmap_g = d(y_hat)
+ y_d_rs.append(y_d_r)
+ fmap_rs.append(fmap_r)
+ y_d_gs.append(y_d_g)
+ fmap_gs.append(fmap_g)
+
+ return y_d_rs, y_d_gs, fmap_rs, fmap_gs
+
+
+def feature_loss(fmap_r, fmap_g):
+ loss = 0
+ for dr, dg in zip(fmap_r, fmap_g):
+ for rl, gl in zip(dr, dg):
+ loss += torch.mean(torch.abs(rl - gl))
+
+ return loss * 2
+
+
+def discriminator_loss(disc_real_outputs, disc_generated_outputs):
+ loss = 0
+ r_losses = []
+ g_losses = []
+ for dr, dg in zip(disc_real_outputs, disc_generated_outputs):
+ r_loss = torch.mean((1 - dr) ** 2)
+ g_loss = torch.mean(dg ** 2)
+ loss += (r_loss + g_loss)
+ r_losses.append(r_loss.item())
+ g_losses.append(g_loss.item())
+
+ return loss, r_losses, g_losses
+
+
+def generator_loss(disc_outputs):
+ loss = 0
+ gen_losses = []
+ for dg in disc_outputs:
+ l = torch.mean((1 - dg) ** 2)
+ gen_losses.append(l)
+ loss += l
+
+ return loss, gen_losses
diff --git a/so-vits-svc/vdecoder/hifigan/nvSTFT.py b/so-vits-svc/vdecoder/hifigan/nvSTFT.py
new file mode 100644
index 0000000000000000000000000000000000000000..88597d62a505715091f9ba62d38bf0a85a31b95a
--- /dev/null
+++ b/so-vits-svc/vdecoder/hifigan/nvSTFT.py
@@ -0,0 +1,111 @@
+import math
+import os
+os.environ["LRU_CACHE_CAPACITY"] = "3"
+import random
+import torch
+import torch.utils.data
+import numpy as np
+import librosa
+from librosa.util import normalize
+from librosa.filters import mel as librosa_mel_fn
+from scipy.io.wavfile import read
+import soundfile as sf
+
+def load_wav_to_torch(full_path, target_sr=None, return_empty_on_exception=False):
+ sampling_rate = None
+ try:
+ data, sampling_rate = sf.read(full_path, always_2d=True)# than soundfile.
+ except Exception as ex:
+ print(f"'{full_path}' failed to load.\nException:")
+ print(ex)
+ if return_empty_on_exception:
+ return [], sampling_rate or target_sr or 32000
+ else:
+ raise Exception(ex)
+
+ if len(data.shape) > 1:
+ data = data[:, 0]
+ assert len(data) > 2# check duration of audio file is > 2 samples (because otherwise the slice operation was on the wrong dimension)
+
+ if np.issubdtype(data.dtype, np.integer): # if audio data is type int
+ max_mag = -np.iinfo(data.dtype).min # maximum magnitude = min possible value of intXX
+ else: # if audio data is type fp32
+ max_mag = max(np.amax(data), -np.amin(data))
+ max_mag = (2**31)+1 if max_mag > (2**15) else ((2**15)+1 if max_mag > 1.01 else 1.0) # data should be either 16-bit INT, 32-bit INT or [-1 to 1] float32
+
+ data = torch.FloatTensor(data.astype(np.float32))/max_mag
+
+ if (torch.isinf(data) | torch.isnan(data)).any() and return_empty_on_exception:# resample will crash with inf/NaN inputs. return_empty_on_exception will return empty arr instead of except
+ return [], sampling_rate or target_sr or 32000
+ if target_sr is not None and sampling_rate != target_sr:
+ data = torch.from_numpy(librosa.core.resample(data.numpy(), orig_sr=sampling_rate, target_sr=target_sr))
+ sampling_rate = target_sr
+
+ return data, sampling_rate
+
+def dynamic_range_compression(x, C=1, clip_val=1e-5):
+ return np.log(np.clip(x, a_min=clip_val, a_max=None) * C)
+
+def dynamic_range_decompression(x, C=1):
+ return np.exp(x) / C
+
+def dynamic_range_compression_torch(x, C=1, clip_val=1e-5):
+ return torch.log(torch.clamp(x, min=clip_val) * C)
+
+def dynamic_range_decompression_torch(x, C=1):
+ return torch.exp(x) / C
+
+class STFT():
+ def __init__(self, sr=22050, n_mels=80, n_fft=1024, win_size=1024, hop_length=256, fmin=20, fmax=11025, clip_val=1e-5):
+ self.target_sr = sr
+
+ self.n_mels = n_mels
+ self.n_fft = n_fft
+ self.win_size = win_size
+ self.hop_length = hop_length
+ self.fmin = fmin
+ self.fmax = fmax
+ self.clip_val = clip_val
+ self.mel_basis = {}
+ self.hann_window = {}
+
+ def get_mel(self, y, center=False):
+ sampling_rate = self.target_sr
+ n_mels = self.n_mels
+ n_fft = self.n_fft
+ win_size = self.win_size
+ hop_length = self.hop_length
+ fmin = self.fmin
+ fmax = self.fmax
+ clip_val = self.clip_val
+
+ if torch.min(y) < -1.:
+ print('min value is ', torch.min(y))
+ if torch.max(y) > 1.:
+ print('max value is ', torch.max(y))
+
+ if fmax not in self.mel_basis:
+ mel = librosa_mel_fn(sr=sampling_rate, n_fft=n_fft, n_mels=n_mels, fmin=fmin, fmax=fmax)
+ self.mel_basis[str(fmax)+'_'+str(y.device)] = torch.from_numpy(mel).float().to(y.device)
+ self.hann_window[str(y.device)] = torch.hann_window(self.win_size).to(y.device)
+
+ y = torch.nn.functional.pad(y.unsqueeze(1), (int((n_fft-hop_length)/2), int((n_fft-hop_length)/2)), mode='reflect')
+ y = y.squeeze(1)
+
+ spec = torch.stft(y, n_fft, hop_length=hop_length, win_length=win_size, window=self.hann_window[str(y.device)],
+ center=center, pad_mode='reflect', normalized=False, onesided=True)
+ # print(111,spec)
+ spec = torch.sqrt(spec.pow(2).sum(-1)+(1e-9))
+ # print(222,spec)
+ spec = torch.matmul(self.mel_basis[str(fmax)+'_'+str(y.device)], spec)
+ # print(333,spec)
+ spec = dynamic_range_compression_torch(spec, clip_val=clip_val)
+ # print(444,spec)
+ return spec
+
+ def __call__(self, audiopath):
+ audio, sr = load_wav_to_torch(audiopath, target_sr=self.target_sr)
+ spect = self.get_mel(audio.unsqueeze(0)).squeeze(0)
+ return spect
+
+stft = STFT()
diff --git a/so-vits-svc/vdecoder/hifigan/utils.py b/so-vits-svc/vdecoder/hifigan/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..9c93c996d3cc73c30d71c1fc47056e4230f35c0f
--- /dev/null
+++ b/so-vits-svc/vdecoder/hifigan/utils.py
@@ -0,0 +1,68 @@
+import glob
+import os
+import matplotlib
+import torch
+from torch.nn.utils import weight_norm
+# matplotlib.use("Agg")
+import matplotlib.pylab as plt
+
+
+def plot_spectrogram(spectrogram):
+ fig, ax = plt.subplots(figsize=(10, 2))
+ im = ax.imshow(spectrogram, aspect="auto", origin="lower",
+ interpolation='none')
+ plt.colorbar(im, ax=ax)
+
+ fig.canvas.draw()
+ plt.close()
+
+ return fig
+
+
+def init_weights(m, mean=0.0, std=0.01):
+ classname = m.__class__.__name__
+ if classname.find("Conv") != -1:
+ m.weight.data.normal_(mean, std)
+
+
+def apply_weight_norm(m):
+ classname = m.__class__.__name__
+ if classname.find("Conv") != -1:
+ weight_norm(m)
+
+
+def get_padding(kernel_size, dilation=1):
+ return int((kernel_size*dilation - dilation)/2)
+
+
+def load_checkpoint(filepath, device):
+ assert os.path.isfile(filepath)
+ print("Loading '{}'".format(filepath))
+ checkpoint_dict = torch.load(filepath, map_location=device)
+ print("Complete.")
+ return checkpoint_dict
+
+
+def save_checkpoint(filepath, obj):
+ print("Saving checkpoint to {}".format(filepath))
+ torch.save(obj, filepath)
+ print("Complete.")
+
+
+def del_old_checkpoints(cp_dir, prefix, n_models=2):
+ pattern = os.path.join(cp_dir, prefix + '????????')
+ cp_list = glob.glob(pattern) # get checkpoint paths
+ cp_list = sorted(cp_list)# sort by iter
+ if len(cp_list) > n_models: # if more than n_models models are found
+ for cp in cp_list[:-n_models]:# delete the oldest models other than lastest n_models
+ open(cp, 'w').close()# empty file contents
+ os.unlink(cp)# delete file (move to trash when using Colab)
+
+
+def scan_checkpoint(cp_dir, prefix):
+ pattern = os.path.join(cp_dir, prefix + '????????')
+ cp_list = glob.glob(pattern)
+ if len(cp_list) == 0:
+ return None
+ return sorted(cp_list)[-1]
+
diff --git a/so-vits-svc/vdecoder/nsf_hifigan/__pycache__/env.cpython-38.pyc b/so-vits-svc/vdecoder/nsf_hifigan/__pycache__/env.cpython-38.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..853314dedbc34ae01c455a12bd341c9805abbb64
Binary files /dev/null and b/so-vits-svc/vdecoder/nsf_hifigan/__pycache__/env.cpython-38.pyc differ
diff --git a/so-vits-svc/vdecoder/nsf_hifigan/__pycache__/models.cpython-38.pyc b/so-vits-svc/vdecoder/nsf_hifigan/__pycache__/models.cpython-38.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..2d8fcb68a6bbc90f76355bd5edffb738e98e2895
Binary files /dev/null and b/so-vits-svc/vdecoder/nsf_hifigan/__pycache__/models.cpython-38.pyc differ
diff --git a/so-vits-svc/vdecoder/nsf_hifigan/__pycache__/nvSTFT.cpython-38.pyc b/so-vits-svc/vdecoder/nsf_hifigan/__pycache__/nvSTFT.cpython-38.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..bbf57b152f3fb3dac1d67de1f2a386a6d850450e
Binary files /dev/null and b/so-vits-svc/vdecoder/nsf_hifigan/__pycache__/nvSTFT.cpython-38.pyc differ
diff --git a/so-vits-svc/vdecoder/nsf_hifigan/__pycache__/utils.cpython-38.pyc b/so-vits-svc/vdecoder/nsf_hifigan/__pycache__/utils.cpython-38.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..ca3d1723f196b0afd33407ee9e50f242f79f276f
Binary files /dev/null and b/so-vits-svc/vdecoder/nsf_hifigan/__pycache__/utils.cpython-38.pyc differ
diff --git a/so-vits-svc/vdecoder/nsf_hifigan/env.py b/so-vits-svc/vdecoder/nsf_hifigan/env.py
new file mode 100644
index 0000000000000000000000000000000000000000..2bdbc95d4f7a8bad8fd4f5eef657e2b51d946056
--- /dev/null
+++ b/so-vits-svc/vdecoder/nsf_hifigan/env.py
@@ -0,0 +1,15 @@
+import os
+import shutil
+
+
+class AttrDict(dict):
+ def __init__(self, *args, **kwargs):
+ super(AttrDict, self).__init__(*args, **kwargs)
+ self.__dict__ = self
+
+
+def build_env(config, config_name, path):
+ t_path = os.path.join(path, config_name)
+ if config != t_path:
+ os.makedirs(path, exist_ok=True)
+ shutil.copyfile(config, os.path.join(path, config_name))
diff --git a/so-vits-svc/vdecoder/nsf_hifigan/models.py b/so-vits-svc/vdecoder/nsf_hifigan/models.py
new file mode 100644
index 0000000000000000000000000000000000000000..eff691f31ac6bbea686c98982c31ce7b30efee75
--- /dev/null
+++ b/so-vits-svc/vdecoder/nsf_hifigan/models.py
@@ -0,0 +1,435 @@
+import os
+import json
+from .env import AttrDict
+import numpy as np
+import torch
+import torch.nn.functional as F
+import torch.nn as nn
+from torch.nn import Conv1d, ConvTranspose1d, AvgPool1d, Conv2d
+from torch.nn.utils import weight_norm, remove_weight_norm, spectral_norm
+from .utils import init_weights, get_padding
+
+LRELU_SLOPE = 0.1
+
+
+def load_model(model_path, device='cuda'):
+ config_file = os.path.join(os.path.split(model_path)[0], 'config.json')
+ with open(config_file) as f:
+ data = f.read()
+
+ json_config = json.loads(data)
+ h = AttrDict(json_config)
+
+ generator = Generator(h).to(device)
+
+ cp_dict = torch.load(model_path, map_location=device)
+ generator.load_state_dict(cp_dict['generator'])
+ generator.eval()
+ generator.remove_weight_norm()
+ del cp_dict
+ return generator, h
+
+
+class ResBlock1(torch.nn.Module):
+ def __init__(self, h, channels, kernel_size=3, dilation=(1, 3, 5)):
+ super(ResBlock1, self).__init__()
+ self.h = h
+ self.convs1 = nn.ModuleList([
+ weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[0],
+ padding=get_padding(kernel_size, dilation[0]))),
+ weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[1],
+ padding=get_padding(kernel_size, dilation[1]))),
+ weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[2],
+ padding=get_padding(kernel_size, dilation[2])))
+ ])
+ self.convs1.apply(init_weights)
+
+ self.convs2 = nn.ModuleList([
+ weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=1,
+ padding=get_padding(kernel_size, 1))),
+ weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=1,
+ padding=get_padding(kernel_size, 1))),
+ weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=1,
+ padding=get_padding(kernel_size, 1)))
+ ])
+ self.convs2.apply(init_weights)
+
+ def forward(self, x):
+ for c1, c2 in zip(self.convs1, self.convs2):
+ xt = F.leaky_relu(x, LRELU_SLOPE)
+ xt = c1(xt)
+ xt = F.leaky_relu(xt, LRELU_SLOPE)
+ xt = c2(xt)
+ x = xt + x
+ return x
+
+ def remove_weight_norm(self):
+ for l in self.convs1:
+ remove_weight_norm(l)
+ for l in self.convs2:
+ remove_weight_norm(l)
+
+
+class ResBlock2(torch.nn.Module):
+ def __init__(self, h, channels, kernel_size=3, dilation=(1, 3)):
+ super(ResBlock2, self).__init__()
+ self.h = h
+ self.convs = nn.ModuleList([
+ weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[0],
+ padding=get_padding(kernel_size, dilation[0]))),
+ weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[1],
+ padding=get_padding(kernel_size, dilation[1])))
+ ])
+ self.convs.apply(init_weights)
+
+ def forward(self, x):
+ for c in self.convs:
+ xt = F.leaky_relu(x, LRELU_SLOPE)
+ xt = c(xt)
+ x = xt + x
+ return x
+
+ def remove_weight_norm(self):
+ for l in self.convs:
+ remove_weight_norm(l)
+
+
+class SineGen(torch.nn.Module):
+ """ Definition of sine generator
+ SineGen(samp_rate, harmonic_num = 0,
+ sine_amp = 0.1, noise_std = 0.003,
+ voiced_threshold = 0,
+ flag_for_pulse=False)
+ samp_rate: sampling rate in Hz
+ harmonic_num: number of harmonic overtones (default 0)
+ sine_amp: amplitude of sine-wavefrom (default 0.1)
+ noise_std: std of Gaussian noise (default 0.003)
+ voiced_thoreshold: F0 threshold for U/V classification (default 0)
+ flag_for_pulse: this SinGen is used inside PulseGen (default False)
+ Note: when flag_for_pulse is True, the first time step of a voiced
+ segment is always sin(np.pi) or cos(0)
+ """
+
+ def __init__(self, samp_rate, harmonic_num=0,
+ sine_amp=0.1, noise_std=0.003,
+ voiced_threshold=0):
+ super(SineGen, self).__init__()
+ self.sine_amp = sine_amp
+ self.noise_std = noise_std
+ self.harmonic_num = harmonic_num
+ self.dim = self.harmonic_num + 1
+ self.sampling_rate = samp_rate
+ self.voiced_threshold = voiced_threshold
+
+ def _f02uv(self, f0):
+ # generate uv signal
+ uv = torch.ones_like(f0)
+ uv = uv * (f0 > self.voiced_threshold)
+ return uv
+
+ @torch.no_grad()
+ def forward(self, f0, upp):
+ """ sine_tensor, uv = forward(f0)
+ input F0: tensor(batchsize=1, length, dim=1)
+ f0 for unvoiced steps should be 0
+ output sine_tensor: tensor(batchsize=1, length, dim)
+ output uv: tensor(batchsize=1, length, 1)
+ """
+ f0 = f0.unsqueeze(-1)
+ fn = torch.multiply(f0, torch.arange(1, self.dim + 1, device=f0.device).reshape((1, 1, -1)))
+ rad_values = (fn / self.sampling_rate) % 1 ###%1 means the product of n_har cannot be optimized for post-processing
+ rand_ini = torch.rand(fn.shape[0], fn.shape[2], device=fn.device)
+ rand_ini[:, 0] = 0
+ rad_values[:, 0, :] = rad_values[:, 0, :] + rand_ini
+ is_half = rad_values.dtype is not torch.float32
+ tmp_over_one = torch.cumsum(rad_values.double(), 1) # % 1 #####%1 means the following cumsum can no longer be optimized
+ if is_half:
+ tmp_over_one = tmp_over_one.half()
+ else:
+ tmp_over_one = tmp_over_one.float()
+ tmp_over_one *= upp
+ tmp_over_one = F.interpolate(
+ tmp_over_one.transpose(2, 1), scale_factor=upp,
+ mode='linear', align_corners=True
+ ).transpose(2, 1)
+ rad_values = F.interpolate(rad_values.transpose(2, 1), scale_factor=upp, mode='nearest').transpose(2, 1)
+ tmp_over_one %= 1
+ tmp_over_one_idx = (tmp_over_one[:, 1:, :] - tmp_over_one[:, :-1, :]) < 0
+ cumsum_shift = torch.zeros_like(rad_values)
+ cumsum_shift[:, 1:, :] = tmp_over_one_idx * -1.0
+ rad_values = rad_values.double()
+ cumsum_shift = cumsum_shift.double()
+ sine_waves = torch.sin(torch.cumsum(rad_values + cumsum_shift, dim=1) * 2 * np.pi)
+ if is_half:
+ sine_waves = sine_waves.half()
+ else:
+ sine_waves = sine_waves.float()
+ sine_waves = sine_waves * self.sine_amp
+ uv = self._f02uv(f0)
+ uv = F.interpolate(uv.transpose(2, 1), scale_factor=upp, mode='nearest').transpose(2, 1)
+ noise_amp = uv * self.noise_std + (1 - uv) * self.sine_amp / 3
+ noise = noise_amp * torch.randn_like(sine_waves)
+ sine_waves = sine_waves * uv + noise
+ return sine_waves, uv, noise
+
+
+class SourceModuleHnNSF(torch.nn.Module):
+ """ SourceModule for hn-nsf
+ SourceModule(sampling_rate, harmonic_num=0, sine_amp=0.1,
+ add_noise_std=0.003, voiced_threshod=0)
+ sampling_rate: sampling_rate in Hz
+ harmonic_num: number of harmonic above F0 (default: 0)
+ sine_amp: amplitude of sine source signal (default: 0.1)
+ add_noise_std: std of additive Gaussian noise (default: 0.003)
+ note that amplitude of noise in unvoiced is decided
+ by sine_amp
+ voiced_threshold: threhold to set U/V given F0 (default: 0)
+ Sine_source, noise_source = SourceModuleHnNSF(F0_sampled)
+ F0_sampled (batchsize, length, 1)
+ Sine_source (batchsize, length, 1)
+ noise_source (batchsize, length 1)
+ uv (batchsize, length, 1)
+ """
+
+ def __init__(self, sampling_rate, harmonic_num=0, sine_amp=0.1,
+ add_noise_std=0.003, voiced_threshod=0):
+ super(SourceModuleHnNSF, self).__init__()
+
+ self.sine_amp = sine_amp
+ self.noise_std = add_noise_std
+
+ # to produce sine waveforms
+ self.l_sin_gen = SineGen(sampling_rate, harmonic_num,
+ sine_amp, add_noise_std, voiced_threshod)
+
+ # to merge source harmonics into a single excitation
+ self.l_linear = torch.nn.Linear(harmonic_num + 1, 1)
+ self.l_tanh = torch.nn.Tanh()
+
+ def forward(self, x, upp):
+ sine_wavs, uv, _ = self.l_sin_gen(x, upp)
+ sine_merge = self.l_tanh(self.l_linear(sine_wavs))
+ return sine_merge
+
+
+class Generator(torch.nn.Module):
+ def __init__(self, h):
+ super(Generator, self).__init__()
+ self.h = h
+ self.num_kernels = len(h.resblock_kernel_sizes)
+ self.num_upsamples = len(h.upsample_rates)
+ self.m_source = SourceModuleHnNSF(
+ sampling_rate=h.sampling_rate,
+ harmonic_num=8
+ )
+ self.noise_convs = nn.ModuleList()
+ self.conv_pre = weight_norm(Conv1d(h.num_mels, h.upsample_initial_channel, 7, 1, padding=3))
+ resblock = ResBlock1 if h.resblock == '1' else ResBlock2
+
+ self.ups = nn.ModuleList()
+ for i, (u, k) in enumerate(zip(h.upsample_rates, h.upsample_kernel_sizes)):
+ c_cur = h.upsample_initial_channel // (2 ** (i + 1))
+ self.ups.append(weight_norm(
+ ConvTranspose1d(h.upsample_initial_channel // (2 ** i), h.upsample_initial_channel // (2 ** (i + 1)),
+ k, u, padding=(k - u) // 2)))
+ if i + 1 < len(h.upsample_rates): #
+ stride_f0 = int(np.prod(h.upsample_rates[i + 1:]))
+ self.noise_convs.append(Conv1d(
+ 1, c_cur, kernel_size=stride_f0 * 2, stride=stride_f0, padding=stride_f0 // 2))
+ else:
+ self.noise_convs.append(Conv1d(1, c_cur, kernel_size=1))
+ self.resblocks = nn.ModuleList()
+ ch = h.upsample_initial_channel
+ for i in range(len(self.ups)):
+ ch //= 2
+ for j, (k, d) in enumerate(zip(h.resblock_kernel_sizes, h.resblock_dilation_sizes)):
+ self.resblocks.append(resblock(h, ch, k, d))
+
+ self.conv_post = weight_norm(Conv1d(ch, 1, 7, 1, padding=3))
+ self.ups.apply(init_weights)
+ self.conv_post.apply(init_weights)
+ self.upp = int(np.prod(h.upsample_rates))
+
+ def forward(self, x, f0):
+ har_source = self.m_source(f0, self.upp).transpose(1, 2)
+ x = self.conv_pre(x)
+ for i in range(self.num_upsamples):
+ x = F.leaky_relu(x, LRELU_SLOPE)
+ x = self.ups[i](x)
+ x_source = self.noise_convs[i](har_source)
+ x = x + x_source
+ xs = None
+ for j in range(self.num_kernels):
+ if xs is None:
+ xs = self.resblocks[i * self.num_kernels + j](x)
+ else:
+ xs += self.resblocks[i * self.num_kernels + j](x)
+ x = xs / self.num_kernels
+ x = F.leaky_relu(x)
+ x = self.conv_post(x)
+ x = torch.tanh(x)
+
+ return x
+
+ def remove_weight_norm(self):
+ print('Removing weight norm...')
+ for l in self.ups:
+ remove_weight_norm(l)
+ for l in self.resblocks:
+ l.remove_weight_norm()
+ remove_weight_norm(self.conv_pre)
+ remove_weight_norm(self.conv_post)
+
+
+class DiscriminatorP(torch.nn.Module):
+ def __init__(self, period, kernel_size=5, stride=3, use_spectral_norm=False):
+ super(DiscriminatorP, self).__init__()
+ self.period = period
+ norm_f = weight_norm if use_spectral_norm == False else spectral_norm
+ self.convs = nn.ModuleList([
+ norm_f(Conv2d(1, 32, (kernel_size, 1), (stride, 1), padding=(get_padding(5, 1), 0))),
+ norm_f(Conv2d(32, 128, (kernel_size, 1), (stride, 1), padding=(get_padding(5, 1), 0))),
+ norm_f(Conv2d(128, 512, (kernel_size, 1), (stride, 1), padding=(get_padding(5, 1), 0))),
+ norm_f(Conv2d(512, 1024, (kernel_size, 1), (stride, 1), padding=(get_padding(5, 1), 0))),
+ norm_f(Conv2d(1024, 1024, (kernel_size, 1), 1, padding=(2, 0))),
+ ])
+ self.conv_post = norm_f(Conv2d(1024, 1, (3, 1), 1, padding=(1, 0)))
+
+ def forward(self, x):
+ fmap = []
+
+ # 1d to 2d
+ b, c, t = x.shape
+ if t % self.period != 0: # pad first
+ n_pad = self.period - (t % self.period)
+ x = F.pad(x, (0, n_pad), "reflect")
+ t = t + n_pad
+ x = x.view(b, c, t // self.period, self.period)
+
+ for l in self.convs:
+ x = l(x)
+ x = F.leaky_relu(x, LRELU_SLOPE)
+ fmap.append(x)
+ x = self.conv_post(x)
+ fmap.append(x)
+ x = torch.flatten(x, 1, -1)
+
+ return x, fmap
+
+
+class MultiPeriodDiscriminator(torch.nn.Module):
+ def __init__(self, periods=None):
+ super(MultiPeriodDiscriminator, self).__init__()
+ self.periods = periods if periods is not None else [2, 3, 5, 7, 11]
+ self.discriminators = nn.ModuleList()
+ for period in self.periods:
+ self.discriminators.append(DiscriminatorP(period))
+
+ def forward(self, y, y_hat):
+ y_d_rs = []
+ y_d_gs = []
+ fmap_rs = []
+ fmap_gs = []
+ for i, d in enumerate(self.discriminators):
+ y_d_r, fmap_r = d(y)
+ y_d_g, fmap_g = d(y_hat)
+ y_d_rs.append(y_d_r)
+ fmap_rs.append(fmap_r)
+ y_d_gs.append(y_d_g)
+ fmap_gs.append(fmap_g)
+
+ return y_d_rs, y_d_gs, fmap_rs, fmap_gs
+
+
+class DiscriminatorS(torch.nn.Module):
+ def __init__(self, use_spectral_norm=False):
+ super(DiscriminatorS, self).__init__()
+ norm_f = weight_norm if use_spectral_norm == False else spectral_norm
+ self.convs = nn.ModuleList([
+ norm_f(Conv1d(1, 128, 15, 1, padding=7)),
+ norm_f(Conv1d(128, 128, 41, 2, groups=4, padding=20)),
+ norm_f(Conv1d(128, 256, 41, 2, groups=16, padding=20)),
+ norm_f(Conv1d(256, 512, 41, 4, groups=16, padding=20)),
+ norm_f(Conv1d(512, 1024, 41, 4, groups=16, padding=20)),
+ norm_f(Conv1d(1024, 1024, 41, 1, groups=16, padding=20)),
+ norm_f(Conv1d(1024, 1024, 5, 1, padding=2)),
+ ])
+ self.conv_post = norm_f(Conv1d(1024, 1, 3, 1, padding=1))
+
+ def forward(self, x):
+ fmap = []
+ for l in self.convs:
+ x = l(x)
+ x = F.leaky_relu(x, LRELU_SLOPE)
+ fmap.append(x)
+ x = self.conv_post(x)
+ fmap.append(x)
+ x = torch.flatten(x, 1, -1)
+
+ return x, fmap
+
+
+class MultiScaleDiscriminator(torch.nn.Module):
+ def __init__(self):
+ super(MultiScaleDiscriminator, self).__init__()
+ self.discriminators = nn.ModuleList([
+ DiscriminatorS(use_spectral_norm=True),
+ DiscriminatorS(),
+ DiscriminatorS(),
+ ])
+ self.meanpools = nn.ModuleList([
+ AvgPool1d(4, 2, padding=2),
+ AvgPool1d(4, 2, padding=2)
+ ])
+
+ def forward(self, y, y_hat):
+ y_d_rs = []
+ y_d_gs = []
+ fmap_rs = []
+ fmap_gs = []
+ for i, d in enumerate(self.discriminators):
+ if i != 0:
+ y = self.meanpools[i - 1](y)
+ y_hat = self.meanpools[i - 1](y_hat)
+ y_d_r, fmap_r = d(y)
+ y_d_g, fmap_g = d(y_hat)
+ y_d_rs.append(y_d_r)
+ fmap_rs.append(fmap_r)
+ y_d_gs.append(y_d_g)
+ fmap_gs.append(fmap_g)
+
+ return y_d_rs, y_d_gs, fmap_rs, fmap_gs
+
+
+def feature_loss(fmap_r, fmap_g):
+ loss = 0
+ for dr, dg in zip(fmap_r, fmap_g):
+ for rl, gl in zip(dr, dg):
+ loss += torch.mean(torch.abs(rl - gl))
+
+ return loss * 2
+
+
+def discriminator_loss(disc_real_outputs, disc_generated_outputs):
+ loss = 0
+ r_losses = []
+ g_losses = []
+ for dr, dg in zip(disc_real_outputs, disc_generated_outputs):
+ r_loss = torch.mean((1 - dr) ** 2)
+ g_loss = torch.mean(dg ** 2)
+ loss += (r_loss + g_loss)
+ r_losses.append(r_loss.item())
+ g_losses.append(g_loss.item())
+
+ return loss, r_losses, g_losses
+
+
+def generator_loss(disc_outputs):
+ loss = 0
+ gen_losses = []
+ for dg in disc_outputs:
+ l = torch.mean((1 - dg) ** 2)
+ gen_losses.append(l)
+ loss += l
+
+ return loss, gen_losses
diff --git a/so-vits-svc/vdecoder/nsf_hifigan/nvSTFT.py b/so-vits-svc/vdecoder/nsf_hifigan/nvSTFT.py
new file mode 100644
index 0000000000000000000000000000000000000000..62bd5a008f81929054f036c81955d5d73377f772
--- /dev/null
+++ b/so-vits-svc/vdecoder/nsf_hifigan/nvSTFT.py
@@ -0,0 +1,134 @@
+import math
+import os
+os.environ["LRU_CACHE_CAPACITY"] = "3"
+import random
+import torch
+import torch.utils.data
+import numpy as np
+import librosa
+from librosa.util import normalize
+from librosa.filters import mel as librosa_mel_fn
+from scipy.io.wavfile import read
+import soundfile as sf
+import torch.nn.functional as F
+
+def load_wav_to_torch(full_path, target_sr=None, return_empty_on_exception=False):
+ sampling_rate = None
+ try:
+ data, sampling_rate = sf.read(full_path, always_2d=True)# than soundfile.
+ except Exception as ex:
+ print(f"'{full_path}' failed to load.\nException:")
+ print(ex)
+ if return_empty_on_exception:
+ return [], sampling_rate or target_sr or 48000
+ else:
+ raise Exception(ex)
+
+ if len(data.shape) > 1:
+ data = data[:, 0]
+ assert len(data) > 2# check duration of audio file is > 2 samples (because otherwise the slice operation was on the wrong dimension)
+
+ if np.issubdtype(data.dtype, np.integer): # if audio data is type int
+ max_mag = -np.iinfo(data.dtype).min # maximum magnitude = min possible value of intXX
+ else: # if audio data is type fp32
+ max_mag = max(np.amax(data), -np.amin(data))
+ max_mag = (2**31)+1 if max_mag > (2**15) else ((2**15)+1 if max_mag > 1.01 else 1.0) # data should be either 16-bit INT, 32-bit INT or [-1 to 1] float32
+
+ data = torch.FloatTensor(data.astype(np.float32))/max_mag
+
+ if (torch.isinf(data) | torch.isnan(data)).any() and return_empty_on_exception:# resample will crash with inf/NaN inputs. return_empty_on_exception will return empty arr instead of except
+ return [], sampling_rate or target_sr or 48000
+ if target_sr is not None and sampling_rate != target_sr:
+ data = torch.from_numpy(librosa.core.resample(data.numpy(), orig_sr=sampling_rate, target_sr=target_sr))
+ sampling_rate = target_sr
+
+ return data, sampling_rate
+
+def dynamic_range_compression(x, C=1, clip_val=1e-5):
+ return np.log(np.clip(x, a_min=clip_val, a_max=None) * C)
+
+def dynamic_range_decompression(x, C=1):
+ return np.exp(x) / C
+
+def dynamic_range_compression_torch(x, C=1, clip_val=1e-5):
+ return torch.log(torch.clamp(x, min=clip_val) * C)
+
+def dynamic_range_decompression_torch(x, C=1):
+ return torch.exp(x) / C
+
+class STFT():
+ def __init__(self, sr=22050, n_mels=80, n_fft=1024, win_size=1024, hop_length=256, fmin=20, fmax=11025, clip_val=1e-5):
+ self.target_sr = sr
+
+ self.n_mels = n_mels
+ self.n_fft = n_fft
+ self.win_size = win_size
+ self.hop_length = hop_length
+ self.fmin = fmin
+ self.fmax = fmax
+ self.clip_val = clip_val
+ self.mel_basis = {}
+ self.hann_window = {}
+
+ def get_mel(self, y, keyshift=0, speed=1, center=False):
+ sampling_rate = self.target_sr
+ n_mels = self.n_mels
+ n_fft = self.n_fft
+ win_size = self.win_size
+ hop_length = self.hop_length
+ fmin = self.fmin
+ fmax = self.fmax
+ clip_val = self.clip_val
+
+ factor = 2 ** (keyshift / 12)
+ n_fft_new = int(np.round(n_fft * factor))
+ win_size_new = int(np.round(win_size * factor))
+ hop_length_new = int(np.round(hop_length * speed))
+
+ if torch.min(y) < -1.:
+ print('min value is ', torch.min(y))
+ if torch.max(y) > 1.:
+ print('max value is ', torch.max(y))
+
+ mel_basis_key = str(fmax)+'_'+str(y.device)
+ if mel_basis_key not in self.mel_basis:
+ mel = librosa_mel_fn(sr=sampling_rate, n_fft=n_fft, n_mels=n_mels, fmin=fmin, fmax=fmax)
+ self.mel_basis[mel_basis_key] = torch.from_numpy(mel).float().to(y.device)
+
+ keyshift_key = str(keyshift)+'_'+str(y.device)
+ if keyshift_key not in self.hann_window:
+ self.hann_window[keyshift_key] = torch.hann_window(win_size_new).to(y.device)
+
+ pad_left = (win_size_new - hop_length_new) //2
+ pad_right = max((win_size_new- hop_length_new + 1) //2, win_size_new - y.size(-1) - pad_left)
+ if pad_right < y.size(-1):
+ mode = 'reflect'
+ else:
+ mode = 'constant'
+ y = torch.nn.functional.pad(y.unsqueeze(1), (pad_left, pad_right), mode = mode)
+ y = y.squeeze(1)
+
+ spec = torch.stft(y, n_fft_new, hop_length=hop_length_new, win_length=win_size_new, window=self.hann_window[keyshift_key],
+ center=center, pad_mode='reflect', normalized=False, onesided=True, return_complex=False)
+ # print(111,spec)
+ spec = torch.sqrt(spec.pow(2).sum(-1)+(1e-9))
+ if keyshift != 0:
+ size = n_fft // 2 + 1
+ resize = spec.size(1)
+ if resize < size:
+ spec = F.pad(spec, (0, 0, 0, size-resize))
+ spec = spec[:, :size, :] * win_size / win_size_new
+
+ # print(222,spec)
+ spec = torch.matmul(self.mel_basis[mel_basis_key], spec)
+ # print(333,spec)
+ spec = dynamic_range_compression_torch(spec, clip_val=clip_val)
+ # print(444,spec)
+ return spec
+
+ def __call__(self, audiopath):
+ audio, sr = load_wav_to_torch(audiopath, target_sr=self.target_sr)
+ spect = self.get_mel(audio.unsqueeze(0)).squeeze(0)
+ return spect
+
+stft = STFT()
diff --git a/so-vits-svc/vdecoder/nsf_hifigan/utils.py b/so-vits-svc/vdecoder/nsf_hifigan/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..84bff024f4d2e2de194b2a88ee7bbe5f0d33f67c
--- /dev/null
+++ b/so-vits-svc/vdecoder/nsf_hifigan/utils.py
@@ -0,0 +1,68 @@
+import glob
+import os
+import matplotlib
+import torch
+from torch.nn.utils import weight_norm
+matplotlib.use("Agg")
+import matplotlib.pylab as plt
+
+
+def plot_spectrogram(spectrogram):
+ fig, ax = plt.subplots(figsize=(10, 2))
+ im = ax.imshow(spectrogram, aspect="auto", origin="lower",
+ interpolation='none')
+ plt.colorbar(im, ax=ax)
+
+ fig.canvas.draw()
+ plt.close()
+
+ return fig
+
+
+def init_weights(m, mean=0.0, std=0.01):
+ classname = m.__class__.__name__
+ if classname.find("Conv") != -1:
+ m.weight.data.normal_(mean, std)
+
+
+def apply_weight_norm(m):
+ classname = m.__class__.__name__
+ if classname.find("Conv") != -1:
+ weight_norm(m)
+
+
+def get_padding(kernel_size, dilation=1):
+ return int((kernel_size*dilation - dilation)/2)
+
+
+def load_checkpoint(filepath, device):
+ assert os.path.isfile(filepath)
+ print("Loading '{}'".format(filepath))
+ checkpoint_dict = torch.load(filepath, map_location=device)
+ print("Complete.")
+ return checkpoint_dict
+
+
+def save_checkpoint(filepath, obj):
+ print("Saving checkpoint to {}".format(filepath))
+ torch.save(obj, filepath)
+ print("Complete.")
+
+
+def del_old_checkpoints(cp_dir, prefix, n_models=2):
+ pattern = os.path.join(cp_dir, prefix + '????????')
+ cp_list = glob.glob(pattern) # get checkpoint paths
+ cp_list = sorted(cp_list)# sort by iter
+ if len(cp_list) > n_models: # if more than n_models models are found
+ for cp in cp_list[:-n_models]:# delete the oldest models other than lastest n_models
+ open(cp, 'w').close()# empty file contents
+ os.unlink(cp)# delete file (move to trash when using Colab)
+
+
+def scan_checkpoint(cp_dir, prefix):
+ pattern = os.path.join(cp_dir, prefix + '????????')
+ cp_list = glob.glob(pattern)
+ if len(cp_list) == 0:
+ return None
+ return sorted(cp_list)[-1]
+
diff --git a/so-vits-svc/wav_upload.py b/so-vits-svc/wav_upload.py
new file mode 100644
index 0000000000000000000000000000000000000000..1a347fa9359edc21dcd9fe633579bc657c0a3fd4
--- /dev/null
+++ b/so-vits-svc/wav_upload.py
@@ -0,0 +1,21 @@
+from google.colab import files
+import shutil
+import os
+import argparse
+if __name__ == "__main__":
+ parser = argparse.ArgumentParser()
+ parser.add_argument("--type", type=str, required=True, help="type of file to upload")
+ args = parser.parse_args()
+ file_type = args.type
+
+ basepath = os.getcwd()
+ uploaded = files.upload()
+ assert(file_type in ['zip', 'audio'])
+ if file_type == "zip":
+ upload_path = "./upload/"
+ for filename in uploaded.keys():
+ shutil.move(os.path.join(basepath, filename), os.path.join(upload_path, "userzip.zip"))
+ elif file_type == "audio":
+ upload_path = "./raw/"
+ for filename in uploaded.keys():
+ shutil.move(os.path.join(basepath, filename), os.path.join(upload_path, filename))
\ No newline at end of file
diff --git a/so-vits-svc/webUI.py b/so-vits-svc/webUI.py
new file mode 100644
index 0000000000000000000000000000000000000000..037cd834b23c4439e853546cec3b25af458bcada
--- /dev/null
+++ b/so-vits-svc/webUI.py
@@ -0,0 +1,321 @@
+import io
+import os
+
+os.system("wget -nc -P hubert/ https://huggingface.co/spaces/innnky/nanami/resolve/main/checkpoint_best_legacy_500.pt")
+import gradio as gr
+import gradio.processing_utils as gr_pu
+import librosa
+import numpy as np
+import soundfile
+from inference.infer_tool import Svc
+import logging
+import re
+import json
+
+import subprocess
+import edge_tts
+import asyncio
+from scipy.io import wavfile
+import librosa
+import torch
+import time
+import traceback
+from itertools import chain
+from utils import mix_model
+import glob
+
+logging.getLogger('numba').setLevel(logging.WARNING)
+logging.getLogger('markdown_it').setLevel(logging.WARNING)
+logging.getLogger('urllib3').setLevel(logging.WARNING)
+logging.getLogger('matplotlib').setLevel(logging.WARNING)
+logging.getLogger('multipart').setLevel(logging.WARNING)
+
+model = None
+spk = None
+debug = False
+
+cuda = {}
+if torch.cuda.is_available():
+ for i in range(torch.cuda.device_count()):
+ device_name = torch.cuda.get_device_properties(i).name
+ cuda[f"CUDA:{i} {device_name}"] = f"cuda:{i}"
+
+# list available models
+models_list = glob.glob("../models/*/G_*.pth")
+cluster_models_list = glob.glob("../models/*/kmeans_*.pth")
+configs_list = glob.glob("../models/*/config.json")
+
+def upload_mix_append_file(files,sfiles):
+ try:
+ if(sfiles == None):
+ file_paths = [file.name for file in files]
+ else:
+ file_paths = [file.name for file in chain(files,sfiles)]
+ p = {file:100 for file in file_paths}
+ return file_paths,mix_model_output1.update(value=json.dumps(p,indent=2))
+ except Exception as e:
+ if debug: traceback.print_exc()
+ raise gr.Error(e)
+
+def mix_submit_click(js,mode):
+ try:
+ assert js.lstrip()!=""
+ modes = {"凸组合":0, "线性组合":1}
+ mode = modes[mode]
+ data = json.loads(js)
+ data = list(data.items())
+ model_path,mix_rate = zip(*data)
+ path = mix_model(model_path,mix_rate,mode)
+ return f"成功,文件被保存在了{path}"
+ except Exception as e:
+ if debug: traceback.print_exc()
+ raise gr.Error(e)
+
+def updata_mix_info(files):
+ try:
+ if files == None : return mix_model_output1.update(value="")
+ p = {file.name:100 for file in files}
+ return mix_model_output1.update(value=json.dumps(p,indent=2))
+ except Exception as e:
+ if debug: traceback.print_exc()
+ raise gr.Error(e)
+
+def modelAnalysis(model_path,config_path,cluster_model_path,device,enhance):
+ global model
+ try:
+ device = cuda[device] if "CUDA" in device else device
+ model = Svc(model_path, config_path, device=device if device!="Auto" else None, cluster_model_path = cluster_model_path if cluster_model_path != None else "",nsf_hifigan_enhance=enhance)
+ spks = list(model.spk2id.keys())
+ device_name = torch.cuda.get_device_properties(model.dev).name if "cuda" in str(model.dev) else str(model.dev)
+ msg = f"成功加载模型到设备{device_name}上\n"
+ if cluster_model_path is None:
+ msg += "未加载聚类模型\n"
+ else:
+ msg += f"聚类模型{cluster_model_path}加载成功\n"
+ msg += "当前模型的可用音色:\n"
+ for i in spks:
+ msg += i + " "
+ return sid.update(choices = spks,value=spks[0]), msg
+ except Exception as e:
+ if debug: traceback.print_exc()
+ raise gr.Error(e)
+
+
+def modelUnload():
+ global model
+ if model is None:
+ return sid.update(choices = [],value=""),"没有模型需要卸载!"
+ else:
+ model.unload_model()
+ model = None
+ torch.cuda.empty_cache()
+ return sid.update(choices = [],value=""),"模型卸载完毕!"
+
+
+def vc_fn(sid, input_audio, vc_transform, auto_f0,cluster_ratio, slice_db, noise_scale,pad_seconds,cl_num,lg_num,lgr_num,F0_mean_pooling,enhancer_adaptive_key,cr_threshold):
+ global model
+ try:
+ if input_audio is None:
+ raise gr.Error("你需要上传音频")
+ if model is None:
+ raise gr.Error("你需要指定模型")
+ sampling_rate, audio = input_audio
+ # print(audio.shape,sampling_rate)
+ audio = (audio / np.iinfo(audio.dtype).max).astype(np.float32)
+ if len(audio.shape) > 1:
+ audio = librosa.to_mono(audio.transpose(1, 0))
+ temp_path = "temp.wav"
+ soundfile.write(temp_path, audio, sampling_rate, format="wav")
+ _audio = model.slice_inference(temp_path, sid, vc_transform, slice_db, cluster_ratio, auto_f0, noise_scale,pad_seconds,cl_num,lg_num,lgr_num,F0_mean_pooling,enhancer_adaptive_key,cr_threshold)
+ model.clear_empty()
+ os.remove(temp_path)
+ #构建保存文件的路径,并保存到results文件夹内
+ try:
+ timestamp = str(int(time.time()))
+ filename = sid + "_" + timestamp + ".wav"
+ output_file = os.path.join("./results", filename)
+ soundfile.write(output_file, _audio, model.target_sample, format="wav")
+ return f"推理成功,音频文件保存为results/{filename}", (model.target_sample, _audio)
+ except Exception as e:
+ if debug: traceback.print_exc()
+ return f"文件保存失败,请手动保存", (model.target_sample, _audio)
+ except Exception as e:
+ if debug: traceback.print_exc()
+ raise gr.Error(e)
+
+
+def tts_func(_text,_rate,_voice):
+ #使用edge-tts把文字转成音频
+ # voice = "zh-CN-XiaoyiNeural"#女性,较高音
+ # voice = "zh-CN-YunxiNeural"#男性
+ voice = "zh-CN-YunxiNeural"#男性
+ if ( _voice == "女" ) : voice = "zh-CN-XiaoyiNeural"
+ output_file = _text[0:10]+".wav"
+ # communicate = edge_tts.Communicate(_text, voice)
+ # await communicate.save(output_file)
+ if _rate>=0:
+ ratestr="+{:.0%}".format(_rate)
+ elif _rate<0:
+ ratestr="{:.0%}".format(_rate)#减号自带
+
+ p=subprocess.Popen("edge-tts "+
+ " --text "+_text+
+ " --write-media "+output_file+
+ " --voice "+voice+
+ " --rate="+ratestr
+ ,shell=True,
+ stdout=subprocess.PIPE,
+ stdin=subprocess.PIPE)
+ p.wait()
+ return output_file
+
+def text_clear(text):
+ return re.sub(r"[\n\,\(\) ]", "", text)
+
+def vc_fn2(sid, input_audio, vc_transform, auto_f0,cluster_ratio, slice_db, noise_scale,pad_seconds,cl_num,lg_num,lgr_num,text2tts,tts_rate,tts_voice,F0_mean_pooling,enhancer_adaptive_key,cr_threshold):
+ #使用edge-tts把文字转成音频
+ text2tts=text_clear(text2tts)
+ output_file=tts_func(text2tts,tts_rate,tts_voice)
+
+ #调整采样率
+ sr2=44100
+ wav, sr = librosa.load(output_file)
+ wav2 = librosa.resample(wav, orig_sr=sr, target_sr=sr2)
+ save_path2= text2tts[0:10]+"_44k"+".wav"
+ wavfile.write(save_path2,sr2,
+ (wav2 * np.iinfo(np.int16).max).astype(np.int16)
+ )
+
+ #读取音频
+ sample_rate, data=gr_pu.audio_from_file(save_path2)
+ vc_input=(sample_rate, data)
+
+ a,b=vc_fn(sid, vc_input, vc_transform,auto_f0,cluster_ratio, slice_db, noise_scale,pad_seconds,cl_num,lg_num,lgr_num,F0_mean_pooling,enhancer_adaptive_key,cr_threshold)
+ os.remove(output_file)
+ os.remove(save_path2)
+ return a,b
+
+def debug_change():
+ global debug
+ debug = debug_button.value
+
+with gr.Blocks(
+ theme=gr.themes.Base(
+ primary_hue = gr.themes.colors.green,
+ font=["Source Sans Pro", "Arial", "sans-serif"],
+ font_mono=['JetBrains mono', "Consolas", 'Courier New']
+ ),
+) as app:
+ with gr.Tabs():
+ with gr.TabItem("推理"):
+ gr.Markdown(value="""
+ So-vits-svc 4.0 推理 webui
+ """)
+
+ with gr.Row(variant="panel"):
+ with gr.Column():
+ gr.Markdown(value="""
+ 模型设置
+ """)
+ model_path = gr.Dropdown(label="可用模型列表", choices=models_list)
+ config_path = gr.Dropdown(label="可用配置文件列表", choices=configs_list)
+ cluster_model_path = gr.Dropdown(label="选择聚类模型文件(可以不选)", choices=cluster_models_list, value=None)
+ device = gr.Dropdown(label="推理设备,默认为自动选择CPU和GPU", choices=["Auto",*cuda.keys(),"CPU"], value="Auto")
+ enhance = gr.Checkbox(label="是否使用NSF_HIFIGAN增强,该选项对部分训练集少的模型有一定的音质增强效果,但是对训练好的模型有反面效果,默认关闭", value=False)
+ with gr.Column():
+ gr.Markdown(value="""
+ 左侧文件全部选择完毕后(全部文件模块显示download),点击“加载模型”进行解析:
+ """)
+ model_load_button = gr.Button(value="加载模型", variant="primary")
+ model_unload_button = gr.Button(value="卸载模型", variant="primary")
+ sid = gr.Dropdown(label="音色(说话人)")
+ sid_output = gr.Textbox(label="Output Message")
+
+
+ with gr.Row(variant="panel"):
+ with gr.Column():
+ gr.Markdown(value="""
+ 推理设置
+ """)
+ auto_f0 = gr.Checkbox(label="自动f0预测,配合聚类模型f0预测效果更好,会导致变调功能失效(仅限转换语音,歌声勾选此项会究极跑调)", value=False)
+ F0_mean_pooling = gr.Checkbox(label="是否对F0使用均值滤波器(池化),对部分哑音有改善。注意,启动该选项会导致推理速度下降,默认关闭", value=False)
+ vc_transform = gr.Number(label="变调(整数,可以正负,半音数量,升高八度就是12)", value=0)
+ cluster_ratio = gr.Number(label="聚类模型混合比例,0-1之间,0即不启用聚类。使用聚类模型能提升音色相似度,但会导致咬字下降(如果使用建议0.5左右)", value=0)
+ slice_db = gr.Number(label="切片阈值", value=-40)
+ noise_scale = gr.Number(label="noise_scale 建议不要动,会影响音质,玄学参数", value=0.4)
+ with gr.Column():
+ pad_seconds = gr.Number(label="推理音频pad秒数,由于未知原因开头结尾会有异响,pad一小段静音段后就不会出现", value=0.5)
+ cl_num = gr.Number(label="音频自动切片,0为不切片,单位为秒(s)", value=0)
+ lg_num = gr.Number(label="两端音频切片的交叉淡入长度,如果自动切片后出现人声不连贯可调整该数值,如果连贯建议采用默认值0,注意,该设置会影响推理速度,单位为秒/s", value=0)
+ lgr_num = gr.Number(label="自动音频切片后,需要舍弃每段切片的头尾。该参数设置交叉长度保留的比例,范围0-1,左开右闭", value=0.75)
+ enhancer_adaptive_key = gr.Number(label="使增强器适应更高的音域(单位为半音数)|默认为0", value=0)
+ cr_threshold = gr.Number(label="F0过滤阈值,只有启动f0_mean_pooling时有效. 数值范围从0-1. 降低该值可减少跑调概率,但会增加哑音", value=0.05)
+ with gr.Tabs():
+ with gr.TabItem("音频转音频"):
+ vc_input3 = gr.Audio(label="选择音频")
+ vc_submit = gr.Button("音频转换", variant="primary")
+ with gr.TabItem("文字转音频"):
+ text2tts=gr.Textbox(label="在此输入要转译的文字。注意,使用该功能建议打开F0预测,不然会很怪")
+ tts_rate = gr.Number(label="tts语速", value=0)
+ tts_voice = gr.Radio(label="性别",choices=["男","女"], value="男")
+ vc_submit2 = gr.Button("文字转换", variant="primary")
+ with gr.Row():
+ with gr.Column():
+ vc_output1 = gr.Textbox(label="Output Message")
+ with gr.Column():
+ vc_output2 = gr.Audio(label="Output Audio", interactive=False)
+
+ with gr.TabItem("小工具/实验室特性"):
+ gr.Markdown(value="""
+ So-vits-svc 4.0 小工具/实验室特性
+ """)
+ with gr.Tabs():
+ with gr.TabItem("静态声线融合"):
+ gr.Markdown(value="""
+ 介绍:该功能可以将多个声音模型合成为一个声音模型(多个模型参数的凸组合或线性组合),从而制造出现实中不存在的声线
+ 注意:
+ 1.该功能仅支持单说话人的模型
+ 2.如果强行使用多说话人模型,需要保证多个模型的说话人数量相同,这样可以混合同一个SpaekerID下的声音
+ 3.保证所有待混合模型的config.json中的model字段是相同的
+ 4.输出的混合模型可以使用待合成模型的任意一个config.json,但聚类模型将不能使用
+ 5.批量上传模型的时候最好把模型放到一个文件夹选中后一起上传
+ 6.混合比例调整建议大小在0-100之间,也可以调为其他数字,但在线性组合模式下会出现未知的效果
+ 7.混合完毕后,文件将会保存在项目根目录中,文件名为output.pth
+ 8.凸组合模式会将混合比例执行Softmax使混合比例相加为1,而线性组合模式不会
+
+ """)
+ mix_model_path = gr.Files(label="选择需要混合模型文件")
+ mix_model_upload_button = gr.UploadButton("选择/追加需要混合模型文件", file_count="multiple", variant="primary")
+ mix_model_output1 = gr.Textbox(
+ label="混合比例调整,单位/%",
+ interactive = True
+ )
+ mix_mode = gr.Radio(choices=["凸组合", "线性组合"], label="融合模式",value="凸组合",interactive = True)
+ mix_submit = gr.Button("声线融合启动", variant="primary")
+ mix_model_output2 = gr.Textbox(
+ label="Output Message"
+ )
+ mix_model_path.change(updata_mix_info,[mix_model_path],[mix_model_output1])
+ mix_model_upload_button.upload(upload_mix_append_file, [mix_model_upload_button,mix_model_path], [mix_model_path,mix_model_output1])
+ mix_submit.click(mix_submit_click, [mix_model_output1,mix_mode], [mix_model_output2])
+
+
+ with gr.Tabs():
+ with gr.Row(variant="panel"):
+ with gr.Column():
+ gr.Markdown(value="""
+ WebUI设置
+ """)
+ debug_button = gr.Checkbox(label="Debug模式,如果向社区反馈BUG需要打开,打开后控制台可以显示具体错误提示", value=debug)
+ vc_submit.click(vc_fn, [sid, vc_input3, vc_transform,auto_f0,cluster_ratio, slice_db, noise_scale,pad_seconds,cl_num,lg_num,lgr_num,F0_mean_pooling,enhancer_adaptive_key,cr_threshold], [vc_output1, vc_output2])
+ vc_submit2.click(vc_fn2, [sid, vc_input3, vc_transform,auto_f0,cluster_ratio, slice_db, noise_scale,pad_seconds,cl_num,lg_num,lgr_num,text2tts,tts_rate,tts_voice,F0_mean_pooling,enhancer_adaptive_key,cr_threshold], [vc_output1, vc_output2])
+ debug_button.change(debug_change,[],[])
+ model_load_button.click(modelAnalysis,[model_path,config_path,cluster_model_path,device,enhance],[sid,sid_output])
+ model_unload_button.click(modelUnload,[],[sid,sid_output])
+ app.launch(
+ server_name=os.environ.get("SERVER_NAME", "0.0.0.0"),
+ server_port=int(os.environ.get("SERVER_PORT", 7860))
+ )
+
+
+