Hexgrad PRO
AI & ML interests
Recent Activity
Articles
Organizations
hexgrad's activity
Feedback appreciated, both positive or negative. Non-English languages haven't been validated by the model creator(s), so if you're a native speaker, criticize away!
γγ³γ³γγγ£γΌγγ£γΌγ¨γΉγ―γθ±θͺγ¨ζ₯ζ¬θͺγ«ε γγ¦γδΈε½θͺγιε½θͺγγγ©γ³γΉθͺγθ©±γγγ¨γγ§γγγγγ«γͺγγΎγγγγ
Wav converted to mp4 using FFmpeg, since audio attachments aren't allowed in Posts. You may have to unmute the video.
The voice quality actually sounds close to ElevenLabs.
I might've mentioned this elsewhere, but if you plug Kokoro outputs for named ElevenLabs voices into https://elevenlabs.io/ai-speech-classifier you should get very reliable positives (98% confident generated by ElevenLabs).
By ear, I think Kokoro is indeed close to ElevenLabs, especially on certain voices. For Nicole, they are indistinguishable to me. Michael is pretty close; Adam is still somewhat weak.
But StyleTTS usually is not very emotional.
I agree. Kokoro also has 2 specific issues in this area: (1) little to no emotional audio seen during training, and (2) even if there was, the stock voices are average style vectors over 10-100 samples, creating an average/neutral style anyway.
self.brag():
Kokoro finally got 300 votes in
Pendrokar/TTS-Spaces-Arena after
@Pendrokar
was kind enough to add it 3 weeks ago.Discounting the small sample size of votes, I think it is safe to say that hexgrad/Kokoro-TTS is currently a top 3 model among the contenders in that Arena. This is notable because:
- At 82M params, Kokoro is one of the smaller models in the Arena
- MeloTTS has 52M params
- F5 TTS has 330M params
- XTTSv2 has 467M params
I used ffmpeg to make the video:
ffmpeg -i input.wav -r 25 -filter_complex "[0:a]compand,showwaves=size=400x400:colors=#ffd700:draw=full:mode=line,format=yuv420p[vout]" -map "[vout]" -map 0:a -c:v libx264 -c:a aac output.mp4
It's expressive, punches way above its weight class, and supports voice cloning. Go check it out! π
(Unmute the audio sample below after hitting play)
What tool are you using to generate that video?
No voice cloning yet, but an 80M model I trained makes this:
If the voice sounds familiar, it is, and the classifier seems to agree.
At 500M parameters, it's efficient enough to run on basic hardware but powerful enough for professional use.
This could transform how we produce audio content for new - think instant translated interviews keeping original voices, or scaled audio article production!
Demo and Model on the Hub: OuteAI/OuteTTS-0.2-500M h/t @reach-vb
This is conjecture, but it's possible the voice sample for XTTS is in-distribution, i.e. seen during training, and if so you'd expect it to perform better than F5 given the same reference. No knock on XTTS btw, Kokoro is equally guilty for thisβthe voice used in the Arena is also in-distribution.
It would not be surprising to me if voice cloning is simply "looking up" the most similar speaker or interpolation of speakers seen in training. François Chollet has discussed this phenomenon many times wrt LLMs, and I highly recommend to listening to his talks.
https://hf.co/spaces/hexgrad/Kokoro-TTS/discussions/3#6744bdea8c689a7071742134
Read more and listen to before/after audio samples at https://hf.co/blog/hexgrad/kokoro-short-burst-upgrade
(Probably would have made that Article a Post instead, if audio could be embedded into Posts.)