Christoph Minixhofer

cdminix

16 3 23

AI & ML interests

None yet

Recent Activity

updated a dataset 29 days ago

ttsds/requests

liked a model 3 months ago

marksverdhei/Qwen3-Voice-Embedding-12Hz-1.7B

liked a Space 3 months ago

DynamicSuperb/leaderboard

View all activity

Organizations

posted an update over 1 year ago

Post

1099

As part of some ongoing work, I'm releasing the currently biggest collection of docker containers for state-of-the-art voice cloning TTS systems.
https://github.com/ttsds/datasets

Alongside there is also a nice overview of all systems (see below)

replied to their post almost 2 years ago

Totally agree! Tortoise seems to not get benchmarked/compared to as much as other systems, and I don't know exactly why.

Not just for Tortoise, but for all theses systems it would be interesting how they compare to each other when finetuned. Unfortantely I don't know of any benchmarks/papers that have tried to evaluate that (yet).

posted an update almost 2 years ago

Post

628

I just added 5 more models to my open source TTS model benchmark, ttsds/benchmark.
Let's talk about the results!

Over the last couple days, I added jbetker/tortoise-tts-v2, metavoiceio/metavoice-1B-v0.1, audo/HierSpeechpp, and the unofficial implementations of amphion/NaturalSpeech2 and amphion/valle by

amphion

Takeaways:
- TorToiSe does very well, falling into second place after StyleTTS 2, which is also ranked first in the human evaluation at TTS-AGI/TTS-Arena.
- MetaVoice-1B's overall score is dragged down by its Intelligibility Score (probably due to utterances being cut short), it achieves #3 in Speaker Score, which indicates good voice cloning ability.
- HierSpeech++ lands in the middle of the road in terms of performance, but excels at the Environment Score, achieving #2 - this means the model is especially good at modeling recording conditions such as microphone and background noise.
- The Amphion models, possibly due to not being trained for the same amount as in the papers, achieve relatively low scores. However, they seem to struggle for different reasons. The autoregressive VALLE models have low Intelligibility Scores (possibly due to "babbling" or early stop tokens) while NaturalSpeech2 has low Speaker and Prosody scores.

What's next?
I'm planning to add more open source TTS models like suno/bark, CAMB-AI/MARS5-TTS and fishaudio/fish-speech-1.2. I'll also write an article on these and all the other results soon, since our paper, TTSDS -- Text-to-Speech Distribution Score (2407.12707), mostly focused on establishing the benchmark itself rather than the indiviual TTS systems.

3 replies

posted an update about 2 years ago

Post

2337

Since new TTS (Text-to-Speech) systems are coming out what feels like every day, and it's currently hard to compare them, my latest project has focused on doing just that.

I was inspired by the TTS-AGI/TTS-Arena (definitely check it out if you haven't), which compares recent TTS system using crowdsourced A/B testing.

I wanted to see if we can also do a similar evaluation with objective metrics and it's now available here:
ttsds/benchmark
Anyone can submit a new TTS model, and I hope this can provide a way to get some information on which areas models perform well or poorly in.

The paper with all the details is available here: https://arxiv.org/abs/2407.12707

Christoph Minixhofer

AI & ML interests

Recent Activity

Organizations

cdminix's activity