Spaces:

TTS-AGI
/

TTS-Arena

Running on CPU Upgrade

App Files Files Community

Add a tab for comparing voice cloning between models

#14

by SilentAntagonist - opened Feb 25, 2024

Discussion

SilentAntagonist

Feb 25, 2024

Greetings,

Several of the models you are using support voice cloning. Please add a tab in the arena where we can upload audio samples and compare the voice cloning outputs between the models.

Thank you

Pendrokar

Feb 25, 2024

True, all the current TTS-Arena models support instant voice cloning.
https://github.com/Pendrokar/open-tts-tracker/blob/patch-3/README.md#capability-specifics

Though I don't see a need for human input to evaluate clones. 😕

SilentAntagonist

Feb 25, 2024

•

edited Feb 25, 2024

True, all the current TTS-Arena models support instant voice cloning.
https://github.com/Pendrokar/open-tts-tracker/blob/patch-3/README.md#capability-specifics

Though I don't see a need for human input to evaluate clones. 😕

Besides voice pitch, a good model would also copy the speech patterns, accent and mannerisms from the input audio sample. And evaluating these with AI is difficult atm and human evaluation would be good.

sheng1105

Feb 26, 2024

what voice prompt are currently used?

Pendrokar

Feb 26, 2024

•

edited Feb 26, 2024

Besides voice pitch, a good model would also copy the speech patterns, accent and mannerisms from the input audio sample. And evaluating these with AI is difficult atm and human evaluation would be good.

This process can be automated by having the voice clone synthesize the text of a sample from the original voice that it not part of the dataset. Do that multiple times and then compare the spectrograms to the original sample. The TTS with the least amount deviations wins. No human voting required.

what voice prompt are currently used?

You mean the voices samples used for instant voice cloning? No clue.

SilentAntagonist

Mar 4, 2024

Besides voice pitch, a good model would also copy the speech patterns, accent and mannerisms from the input audio sample. And evaluating these with AI is difficult atm and human evaluation would be good.

This process can be automated by having the voice clone synthesize the text of a sample from the original voice that it not part of the dataset. Do that multiple times and then compare the spectrograms to the original sample. The TTS with the least amount deviations wins. No human voting required.

what voice prompt are currently used?

You mean the voices samples used for instant voice cloning? No clue.

People need to benchmark on samples not present in the dataset

Pendrokar

Feb 2

@SilentAntagonist @sheng1105
This benchmark seems to do just that.
https://huggingface.co/spaces/ttsds/benchmark

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment