Add TTS Model: xVASynth

#15
by Pendrokar - opened

TTS name - xVASynth
Author - Dan Ruta (not me)
Model name - xVAPitch (v3 model, v2 is FastPitch IIRC)
Model link: https://huggingface.co/Pendrokar/xvapitch_nvidia (Note the Legal note)
Model License: CC-BY
TTS License: GPL-3

๐Ÿค— Space: https://huggingface.co/spaces/Pendrokar/xVASynth

Several questions:

  • I hear you only synthesize female voices?
  • Not only that, but American English voices?
  • Can be either 22kHz/24kHz? [edit] โœ” ElevenLabs uses 44kHz
  • Post synthesis super resolution to 44/48kHz allowed? (not used by Space)
  • Is RVC allowed post synthesis? (not used by Space)

Sadly the male voices in xVASynth Space are better than most of the female voices. Except one. But that voice sounds British English. Will have to fetch the NVIDIA dataset to train a proper female American English voice. ๐Ÿค”

The xVASynth Space in particular does not use or support CUDA. Loading a single model takes around 600 MB of RAM. 2 GB of VRAM, if normal xVASynth is run with CUDA. The 2 CPU core Space hit a bottleneck once multiple people tried to use it at once, so had to use CPU Upgraded on launch. Real-Time-Factor is close to 1.0 or lower even on CPU.

Gradio API defaults:

client = Client("Pendrokar/xVASynth")
result = client.predict(
        "Oh, hello."
        "ccby_nvidia_hifi_92_F"
        "en"
        1.0, # duration is 1.0 by default, not 0.5
        0, # pitch unused
        0.1, # energy unused
        0,
        0,
        0,
        0,
        True, # DeepMoji affects inference
        api_name="/predict"
)

The xVASynth Space in particular does not use or support CUDA. Loading a single model takes around 600 MB of RAM. 2 GB of VRAM, if normal xVASynth is run with CUDA. The 2 CPU core Space hit a bottleneck once multiple people tried to use it at once, so had to use CPU Upgraded on launch. Real-Time-Factor is close to 1.0 or lower even on CPU.

No longer an issue for the ๐Ÿค— Space of the TTS now that a HF CPU-Upgrade grant has been given, RTF of below 1.0 is now guaranteed. ๐Ÿ˜‡๐Ÿ‘ผ

Model link: https://huggingface.co/Pendrokar/xvapitch_nvidia (Note the Legal note)

With the inclusion of VoiceCraft v2 into the TTS Arena, which @reach-vb admitted includes non-permissive license, I am dumbfounded by the deafness on adding xVASynth.

To clarify, the quoted fine-tuned voice models themselves are made from datasets of a permissive license. It is the base model whose datasets include non-permissive licenses. The fine-tuning is don on the base models.

Now I am not claiming that xVASynth can go toe to toe with StyleTTS 2 and XTTS, but the preliminary results for xVASynth on the cloned TTS Arena Space are not too shabby. Even if I do force xVASynth to be one of the chosen candidates.
https://huggingface.co/spaces/Pendrokar/TTS-Spaces-Arena

So... what is the issue, if VoiceCraft's non-permissive license wasn't an issue for it? Why are newer and more unknown TTS taking precedence in being included?

Sign up or log in to comment