How to finetune on new language?
I want to finetune on Turkish voice
Yes, that. Also, I have not yet posted recipes for continuing training off the checkpoint uploaded to this repo. I might need to update the Philosophy to include something along the lines of:
Currently, Kokoro is packaged & delivered to you as an end product meant to be used & deployed.
That could change later, but no promises.
However, it should be fairly transparent that Kokoro uses a StyleTTS 2 architecture, which is FOSS/MIT, therefore rolling your own model is always an option. Multilingual STTS2 models can be and have been trained. A big hurdle is finding a good g2p solution for your language—I only speak English so this is a very tight bottleneck, that and data sourcing.
For StyleTTS2:
- @Respair has trained a Japanese model, Tsukasa and also hosted a demo Space
- @patriotyk has trained a Ukrainian model: https://hf.co/spaces/patriotyk/styletts2-ukrainian
- Although the above 2 are dedicated monolingual models, I do not think there is a hard constraint preventing you from stuffing many languages into 1 model. In the Kokoro TTS Space, v0.23 shares 82M params across 5 languages: English, French, Japanese, Korean, Chinese.
Edit, other multilingual STTS2 models I'm aware of:
- I believe Respair has also done Persian
- Someone else (can't find their username on HF right now) has done Korean
- Another person has done 5-way multilingual: English, German, French, Italian and Spanish
Without gatekeeping, I should warn you that training models (especially STTS) is not for the faint of heart, and could take substantial compute/time/experience—all of which are obtainable—to produce good outcomes. I believe XTTS v2 might support Turkish out-of-the-box, but I have not tried it.