Spaces:

artificialguybr
/

video-dubbing

Running on Zero

App Files Files Community

video-dubbing / TTS /docs /source /models /vits.md

artificialguybr

Upload 650 files

45ee559 about 1 year ago

preview code

raw

history blame

1.59 kB

	# VITS

	VITS (Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech
	) is an End-to-End (encoder -> vocoder together) TTS model that takes advantage of SOTA DL techniques like GANs, VAE,
	Normalizing Flows. It does not require external alignment annotations and learns the text-to-audio alignment
	using MAS, as explained in the paper. The model architecture is a combination of GlowTTS encoder and HiFiGAN vocoder.
	It is a feed-forward model with x67.12 real-time factor on a GPU.

	🐸 YourTTS is a multi-speaker and multi-lingual TTS model that can perform voice conversion and zero-shot speaker adaptation.
	It can also learn a new language or voice with a ~ 1 minute long audio clip. This is a big open gate for training
	TTS models in low-resources languages. 🐸 YourTTS uses VITS as the backbone architecture coupled with a speaker encoder model.

	## Important resources & papers
	- 🐸 YourTTS: https://arxiv.org/abs/2112.02418
	- VITS: https://arxiv.org/pdf/2106.06103.pdf
	- Neural Spline Flows: https://arxiv.org/abs/1906.04032
	- Variational Autoencoder: https://arxiv.org/pdf/1312.6114.pdf
	- Generative Adversarial Networks: https://arxiv.org/abs/1406.2661
	- HiFiGAN: https://arxiv.org/abs/2010.05646
	- Normalizing Flows: https://blog.evjang.com/2018/01/nf1.html

	## VitsConfig
	```{eval-rst}
	.. autoclass:: TTS.tts.configs.vits_config.VitsConfig
	:members:
	```

	## VitsArgs
	```{eval-rst}
	.. autoclass:: TTS.tts.models.vits.VitsArgs
	:members:
	```

	## Vits Model
	```{eval-rst}
	.. autoclass:: TTS.tts.models.vits.Vits
	:members:
	```