A small experiment I did with a subset of the moespeech's JP dataset. The models (GPT and SoVITS) are made to run with GPT-SoVITS.

I used 6 hours of audio for the training. The selected audio samples were categorized into frequency bands (100โ€“500 Hz) with 50 Hz intervals. Each band received equal representation in the final dataset to ensure the model learns from a diverse range of voice frequencies. Samples outside the 3โ€“10 second range were discarded due to GPT-SoVITS limitations.

The model is proficient in Japanese only and tends to have a slightly higher pitch than the reference audio (this is corrected by using a low temperature of 0.3). Compared to the base model from GPT-SoVITS, the inflections are much more natural, including laughing, sighing, and other nuances.

The license is cc-by-nc-nd-4.0.

Downloads last month
7
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Dataset used to train AdamCodd/GPTSoVITS2-anime-tts

Collection including AdamCodd/GPTSoVITS2-anime-tts