Can you look into this TTS Engine Fishaudio?
http://github.com/fishaudio/fish-speech
FishSpeech v1.5 - multilingual, zero-shot instant voice cloning, low-latency Only 500M params - #2 ranked on TTS-Arena
New Model
Highlights:
- #2 ranked on TTS-Arena (as "Anonymous Sparkle")
- 1M hours of multilingual training data
- 13 languages supported, including English, Chinese, Japanese & more
- <150ms latency with high-quality instant voice cloning
- Pretrained model now open source
- Cost-effective self-hosting or cloud options
Playground: http://fish.audio/
Code: http://github.com/fishaudio/fish-speech
Demo: http://huggingface.co/spaces/fishaudio/fish-speech-1
Rank: http://huggingface.co/spaces/TTS-AGI/TTS-Arena
Thanks for sharing this. Unfortunately, this won't work well with the very limited amount of TTS data we have. It could be done possibly by using Youtube data but that is not feasible for me right now. I am currently doing research on Speech LLMs which kind of work similar to the work given above which will be able to do TTS too. For your specific request though, I don't have time for it right now. Sorry.