Thank you :)
I have no clue how you trained this, but this is by far the best and most realistic RVCv2 model I've seen.
Seriously. It's one of the very few ones that actually sound natural and authentic with good audio quality and minimal artifacts, and works for both speaking and singing in realtime (I'm using w-okada's RVCC).
its tried with a semi-clean dataset, i found out that training a model with a dataset thats not super clean gives me more realistic models than a super clean one
eg: the nekrolina model too
Semi-clean as in everything is consistent and filtered the same but just lightly? Or as in both clean takes and intentionally leaving a few unfiltered takes?
Any special preprocessing voodoo or just the usual UVR5 stuff and cutting into chunks?
Thanks for the hint btw, the Nekrolina one turned out great, too!
Your TsunamiCat model still has a higher dynamic range, though, so it performs even better on quieter parts. Like when whispering or talking very quietly, the Tsunami one goes into a sexy vocal fry which really helps with overall authenticity. Most models just add unnatural background noise or noise gate artifacts in that range.
I basically just do what I call scuffed models where I just run the original dataset to a background remover like UVR using Vocal FT and just manually remove stuff like tts manually
For background noise I try to intentionally ignore them(keep them in since I'm lazy)