Ask about the datasets of this model
Hi Friends,the model do me a great favour!But there's something I'm curious about that I'd like to ask you about.
How can I ask what your model training set consists of? I've only used your model to reason alone and found that the resulting voices turn out to be a variety of human voices and actually it doesn't sound very clear and has a noticeable noise, but the sound I trained and fitted out with this bottom mold has excellent performance, why? Is your model trained by many voices? If yes, is this a common practice for training a bottom mold? And what are the benefits of this over using single data?
That's the question that's been bothering me lately,and I would really appreciate it if you could reply to me.
Hello, this pre-trained model originally came from the so-vits-svc community and was trained using about 300 hours of data. On this basis, I fine-tuned the model using VCTK training set and Paimon training set to get the final model. I'm not sure about the original training material of the model. However, it is certain that the pre-trained model was trained using a large amount of male and female voice data, covering common vocal ranges for both genders.
The pre-trained model has had the optimizer removed, retaining only the useful parts for training, so you cannot directly load this model for inference. If you could, as you mentioned, the results wouldn't be very good.
There are generally two methods for training pre-trained models:
- Not using the original pre-trained model. That is, retraining a pre-trained model. You need to prepare at least 100 hours of high-quality male and female audio covering a wide vocal range and train a model using multiple speakers. Typically, you need to train for more than 200k steps. After training, use the following code to remove unnecessary model layers.
- Fine-tuning the pre-trained model. That is, fine-tuning the original pre-trained model to train a specialized base model. This method is suitable for a large amount of low-quality single-speaker data. You need to prepare more than 30 hours of single-speaker audio as a training set and fine-tune it with the original pre-trained model loaded, training for about 200k steps. After training, use the following code to remove unnecessary model layers. After obtaining this specialized base model, select higher-quality training audio of the speaker for training. At this point, you only need to train for a few thousand steps.
The first method above is for training a general base model, which can be widely used, while the second method is for a specialized base model, which generally can only be used by the specific person it is specialized for.
Learning from a large amount of multi-speaker audio as a base model can fit the model faster, using fewer training steps to achieve the target effect (but it may cause overfitting).
Here is the code to delete the model layers and convert it to a base model:
import torch
G = "G_197600.pth"
D = "D_197600.pth"
a = torch.load(G)
a["iteration"] = 0
a["optimizer"] = None
a["learning_rate"] = 0.0001
del a["model"]["emb_g.weight"]
torch.save(a, f"clean_{G}")
a = torch.load(D)
a["iteration"] = 0
a["optimizer"] = None
a["learning_rate"] = 0.0001
torch.save(a, f"clean_{D}")
You need to manually modify the model names above.
If you have other questions, feel free to ask me.
Thank you very much for your reply! Your response not only tells me the recommended data, methods, steps, etc. for training a large model from 0, but also tells me how best to fine-tune the individual sounds I want to train from a large model. And it even provides me with the code to delete the model layers and convert it to a base model. This is so helpful to me!My heartfelt thanks for your timeley reply
hello Friends,these is a small question about the dataset you mentioned bothers me,i will appreciate it if you could help me!
Actually,I am taking the second method to train a voice model.As you commended," This method is suitable for a large amount of low-quality single-speaker data,You need to prepare more than 30 hours of single-speaker audio as a training set and fine-tune it with the original pre-trained model loaded, training for about 200k steps".But now ,I get a very high-quality single-speaker data,it is very clean,absolutely pure person voice,but I only have datas about 3 hours .In this case,do you think Should I change my train steps to 200K or other steps? Do I need to expand more data sets? if I need more datasets, is there an algorithm I can use to augment the existing dataset, or do I need to collect more raw data about the speaker's voice?
I'm sorry that my questions may be a little too much, but I really hope to receive your reply, and sincerely,thank you very much!
Hi, if you only have three hours of data, it is not suitable for base model training or specialized base model training. I suggest you directly use the community pre-trained model for training. Normal training for only tens of thousands of steps can achieve very good results. Since your data is very clean, the training outcome will naturally not be bad.
There is no algorithm that can augment the data besides finding additional data yourself. Even copying and pasting won't have any effect.
In summary, if you can find more data (30 hours or more), you can try training a specialized base model first and then continue training. If you only have three hours of data, just load the community base model directly for training.
For more information for training, you can read the official README.md. Or if you need more detailed information, you can read the document written by me. This
is the link: https://github.com/SUC-DriverOld/so-vits-svc-Deployment-Documents
Thank you so much!Your document is so clear and I can get every access to all details,it is so helpful!
In the trainging time these days,I found a problem which is very confusing,and I find many solutions Online but didnt work.
Actually,I am using whisper-ppg-large-v2 to train my model,but the diffusion model cant be used even if I had trained it successfully,I have followed the tutorial and trained it(at the beganing I got failed in the training stage,now the model can be trained well,but when inference it fail),but it always turn out to show the bug below.
Hi Friend , Did you ever meet with this problem?I would really appreciate it if you could reply to me.Thank you again!
Hello friend,
I'm very sorry, but I have never used whisper-ppg-large-v2 to train my model. I can't provide a solution to your error. I went through the usage guide document of the integrated package made by the Chinese expert "Yumao Butuan." It only mentions that whisper-ppg supports diffusion models, but requires both the main model and the diffusion model to use the same encoder. I didn't find any other explanations. Perhaps you can try using the integrated package he made, which includes all the commonly used encoders and training base models, as well as all the necessary dependencies. Here is the Google Drive link: https://drive.google.com/file/d/19e7HJYk32WHVIXm-CuhvbwF8IHjZ48QU/view?usp=sharing. And here is the usage guide for the integrated package: https://www.yuque.com/umoubuton/ueupp5. However, it is written in Chinese.
I'm very sorry that I couldn't solve your problem.