Spaces:

TTS-AGI
/

TTS-Arena

Running on CPU Upgrade

App Files Files Community

Add TTS Model: xVASynth

#15

by Pendrokar - opened Feb 25, 2024

Discussion

Pendrokar

Feb 25, 2024

•

edited Feb 28, 2024

TTS name - xVASynth
Author - Dan Ruta (not me)
Model name - xVAPitch (v3 model, v2 is FastPitch IIRC)
Model link: https://huggingface.co/Pendrokar/xvapitch_nvidia (Note the Legal note)
Model License: CC-BY
TTS License: GPL-3

🤗 Space: https://huggingface.co/spaces/Pendrokar/xVASynth

Several questions:

I hear you only synthesize female voices?
Not only that, but American English voices?
~~Can be either 22kHz/24kHz?~~ [edit] ✔ ElevenLabs uses 44kHz
~~Post synthesis super resolution to 44/48kHz allowed?~~ (not used by Space)
Is RVC allowed post synthesis? (not used by Space)

Sadly the male voices in xVASynth Space are better than most of the female voices. Except one. But that voice sounds British English. Will have to fetch the NVIDIA dataset to train a proper female American English voice. 🤔

The xVASynth Space in particular does not use or support CUDA. Loading a single model takes around 600 MB of RAM. 2 GB of VRAM, if normal xVASynth is run with CUDA. The 2 CPU core Space hit a bottleneck once multiple people tried to use it at once, so had to use CPU Upgraded on launch. Real-Time-Factor is close to 1.0 or lower even on CPU.

Gradio API defaults:

client = Client("Pendrokar/xVASynth")
result = client.predict(
        "Oh, hello."
        "ccby_nvidia_hifi_92_F"
        "en"
        1.0, # duration is 1.0 by default, not 0.5
        0, # pitch unused
        0.1, # energy unused
        0,
        0,
        0,
        0,
        True, # DeepMoji affects inference
        api_name="/predict"
)

Pendrokar

Mar 7, 2024

The xVASynth Space in particular does not use or support CUDA. Loading a single model takes around 600 MB of RAM. 2 GB of VRAM, if normal xVASynth is run with CUDA. The 2 CPU core Space hit a bottleneck once multiple people tried to use it at once, so had to use CPU Upgraded on launch. Real-Time-Factor is close to 1.0 or lower even on CPU.

No longer an issue for the 🤗 Space of the TTS now that a HF CPU-Upgrade grant has been given, RTF of below 1.0 is now guaranteed. 😇👼

Pendrokar

May 1, 2024

Model link: https://huggingface.co/Pendrokar/xvapitch_nvidia (Note the Legal note)

With the inclusion of VoiceCraft v2 into the TTS Arena, which @reach-vb admitted includes non-permissive license, I am dumbfounded by the deafness on adding xVASynth.

To clarify, the quoted fine-tuned voice models themselves are made from datasets of a permissive license. It is the base model whose datasets include non-permissive licenses. The fine-tuning is don on the base models.

Now I am not claiming that xVASynth can go toe to toe with StyleTTS 2 and XTTS, but the preliminary results for xVASynth on the cloned TTS Arena Space are not too shabby. Even if I do force xVASynth to be one of the chosen candidates.
https://huggingface.co/spaces/Pendrokar/TTS-Spaces-Arena

So... what is the issue, if VoiceCraft's non-permissive license wasn't an issue for it? Why are newer and more unknown TTS taking precedence in being included?

Pendrokar

Oct 14, 2024

xVASynth base model would need to be retrained to compete with other TTS engines.

Pendrokar changed discussion status to closed Oct 14, 2024

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment