sanchit-gandhi HF staff commited on
Commit
039851e
1 Parent(s): c4621ab

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +101 -0
README.md ADDED
@@ -0,0 +1,101 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ tags:
4
+ - vits
5
+ pipeline_tag: text-to-speech
6
+ ---
7
+
8
+ # VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech
9
+
10
+ VITS is an end-to-end speech synthesis model that predicts a speech waveform conditional on an input text sequence. It is a
11
+ conditional variational autoencoder (VAE) comprised of a posterior encoder, decoder, and conditional prior. This repository
12
+ contains the weights for the official VITS checkpoint trained on the [VCTK](https://huggingface.co/datasets/vctk) dataset.
13
+
14
+ ## Model Details
15
+
16
+ VITS (**V**ariational **I**nference with adversarial learning for end-to-end **T**ext-to-**S**peech) is an end-to-end
17
+ speech synthesis model that predicts a speech waveform conditional on an input text sequence. It is a conditional variational
18
+ autoencoder (VAE) comprised of a posterior encoder, decoder, and conditional prior.
19
+
20
+ A set of spectrogram-based acoustic features are predicted by the flow-based module, which is formed of a Transformer-based
21
+ text encoder and multiple coupling layers. The spectrogram is decoded using a stack of transposed convolutional layers,
22
+ much in the same style as the HiFi-GAN vocoder. Motivated by the one-to-many nature of the TTS problem, where the same text
23
+ input can be spoken in multiple ways, the model also includes a stochastic duration predictor, which allows the model to
24
+ synthesise speech with different rhythms from the same input text.
25
+
26
+ The model is trained end-to-end with a combination of losses derived from variational lower bound and adversarial training.
27
+ To improve the expressiveness of the model, normalizing flows are applied to the conditional prior distribution. During
28
+ inference, the text encodings are up-sampled based on the duration prediction module, and then mapped into the
29
+ waveform using a cascade of the flow module and HiFi-GAN decoder. Due to the stochastic nature of the duration predictor,
30
+ the model is non-deterministic, and thus requires a fixed seed to generate the same speech waveform.
31
+
32
+ There are two variants of the VITS model: one is trained on the [LJ Speech](https://huggingface.co/datasets/lj_speech) dataset,
33
+ and the other is trained on the [VCTK](https://huggingface.co/datasets/vctk) dataset. LJ Speech dataset consists of 13,100 short
34
+ audio clips of a single speaker with a total length of approximately 24 hours. The VCTK dataset consists of approximately 44,000
35
+ short audio clips uttered by 109 native English speakers with various accents. The total length of the audio clips is approximately
36
+ 44 hours.
37
+
38
+ | Checkpoint | Train Hours | Speakers |
39
+ |------------|-------------|----------|
40
+ | [vits-ljs](https://huggingface.co/kakao-enterprise/vits-ljs) | 24 | 1 |
41
+ | [vits-vctk](https://huggingface.co/kakao-enterprise/vits-vctk) | 44 | 109 |
42
+
43
+ ## Usage
44
+
45
+ VITS is available in the 🤗 Transformers library from version 4.33 onwards. To use this checkpoint,
46
+ first install the latest version of the library:
47
+
48
+ ```
49
+ pip install --upgrade transformers accelerate
50
+ ```
51
+
52
+ Then, run inference with the following code-snippet:
53
+
54
+ ```python
55
+ from transformers import VitsModel, AutoTokenizer
56
+ import torch
57
+
58
+ model = VitsModel.from_pretrained("kakao-enterprise/vits-vctk")
59
+ tokenizer = AutoTokenizer.from_pretrained("kakao-enterprise/vits-vctk")
60
+
61
+ text = "Hey, it's Hugging Face on the phone"
62
+ inputs = tokenizer(text, return_tensors="pt")
63
+
64
+ with torch.no_grad():
65
+ output = model(**inputs).waveform
66
+ ```
67
+
68
+ The resulting waveform can be saved as a `.wav` file:
69
+
70
+ ```python
71
+ import scipy
72
+
73
+ scipy.io.wavfile.write("techno.wav", rate=model.config.sampling_rate, data=output)
74
+ ```
75
+
76
+ Or displayed in a Jupyter Notebook / Google Colab:
77
+
78
+ ```python
79
+ from IPython.display import Audio
80
+
81
+ Audio(output, rate=model.config.sampling_rate)
82
+ ```
83
+
84
+ ## BibTex citation
85
+
86
+ This model was developed by Jaehyeon Kim et al. from Kakao Enterprise. If you use the model, consider citing the VITS paper:
87
+
88
+ ```
89
+ @inproceedings{kim2021conditional,
90
+ title={"Conditional Variational Autoencoder with Adversarial Learning for End-to-end Text-to-speech"},
91
+ author={Kim, Jaehyeon and Kong, Jungil and Son, Juhee},
92
+ booktitle={International Conference on Machine Learning},
93
+ pages={5530--5540},
94
+ year={2021},
95
+ organization={PMLR}
96
+ }
97
+ ```
98
+
99
+ ## License
100
+
101
+ The model is licensed as [**MIT**](https://github.com/jaywalnut310/vits/blob/main/LICENSE).