TTS-Spaces-Arena

Running on Zero

App Files Files Community

Support evaluations of other languages?

#10

by laubonghaudoi - opened Oct 14, 2024

Discussion

laubonghaudoi

Oct 14, 2024

•

edited Oct 14, 2024

Currently all test cases are in English and models are evaluated on their English performance. This misses the multilingual abilities of many TTS models. E.g. XTTS supports 16 languages. Multilingual ability is one important dimension of TTS models, and we can't assume the English performance of a model can be transferred to other languages. It would be very valuable if we can evaluate the performance of non-English languages.

Pendrokar

Owner Oct 14, 2024

•

edited Oct 14, 2024

I plan to add the capability. Not sure what random sentences to use.

Could go with the ones on Common Voice. All those have been evaluated by the public.
https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0/tree/main/transcript

laubonghaudoi

Oct 14, 2024

•

edited Oct 15, 2024

Thank you! Common Voice definitely is a very good choice. All text sentences are in the public domain and the sentences are validated by volunteers. We can sample the sentences from the validated portion as test cases. We just need to be aware that some sentences might not be normalized (with special symbols and unreadable words) and some sentences are too short or too long (sentences with only one single word or hundreds of words). As long as we filter those dirty samples the remaining ones should be very good for TTS test cases.

Pendrokar

Owner Oct 15, 2024

Well at first I wanted to hardcode in another language. The Japanese TTS Arena, which is also a clone of TTS Arena, by @kotoba-tech.
https://huggingface.co/spaces/kotoba-tech/TTS-Arena-JA

Has been down for quite a few weeks. Pinging @lihaoxin2020 @arumaekawa @kojimano3 @jungok to prod their interest in having a multilingual TTS Arena. This also means the top TTS models gets even more scrutiny and challenged. Then the Leaderboard would have language filters. I wanted text style filters too, but those will have to come later.

As text is one thing, the other is getting native voices. Both fine-tuned models and voice samples for zero-shot TTS. That is quite a bit a work to do alone... 😵

laubonghaudoi

Oct 15, 2024

I don't think we need to get copies of fine-tuned models? I thought the arena is meant to benchmark the performance of the base model, not various downstream adapted models. If a model doesn't support a language, we can just say it fails in this dimension, which is still a useful piece of information and an indication of the model capability.

Pendrokar

Owner Jan 10

Code synched #11 , easier structure for other devs to comprehend.

But I feel #7 is required before proceeding with multilingual Arena. Because I expect the popularity of this Space to really skyrocket, which means my account HF token's daily allowance will be diminished quickly on new audio. Getting caches samples from a synthetic dataset is a priority.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment