aoxo commited on
Commit
495c0df
1 Parent(s): 0d48494

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +62 -61
README.md CHANGED
@@ -1,62 +1,63 @@
1
-
2
- ---
3
- license: cc-by-nc-4.0
4
- tags:
5
- - tts
6
- - gpt2
7
- - vae
8
- pipeline_tag: text-to-speech
9
- ---
10
-
11
- # Malayalam Text-to-Speech
12
-
13
- This repository contains the **Malayalam (mal)** language text-to-speech (TTS) model checkpoint.
14
-
15
- ## Model Details
16
-
17
- Sura (**S**tochastic **U**nified **R**epresentation for **A**dversarial learning) is an advanced speech synthesis model that generates speech waveforms conditioned on input text sequences. It is based on a conditional variational autoencoder (VAE) architecture, consisting of a posterior encoder, a decoder, and a conditional prior.
18
-
19
- The model's text encoder is built on GPT-2, while the decoder is a VAE with 124M parameters. The flow-based module predicts spectrogram-based acoustic features, which is composed of the GPT-2-based encoder and cascaded dense layers. The spectrogram is then transformed into a speech waveform using a stack of transposed convolutional layers. To capture the one-to-many nature of TTS, where the same text can be spoken in multiple ways, the model also includes a stochastic duration predictor, allowing for varied speech rhythms from the same text input.
20
-
21
- Sura is trained end-to-end using a combination of losses from the variational lower bound and adversarial training techniques. During inference, the text encodings are up-sampled based on the predicted durations, and subsequently mapped into the waveform via the flow module and the VAE decoder. Due to the stochastic nature of the duration predictor, the model is non-deterministic and requires a fixed seed to produce identical speech outputs.
22
- ## Usage
23
-
24
- ```
25
- pip install --upgrade transformers accelerate
26
- ```
27
-
28
- Then, run inference with the following code-snippet:
29
-
30
- ```python
31
- from transformers import VitsModel, AutoTokenizer
32
- import torch
33
-
34
- model = VitsModel.from_pretrained("aoxo/gpt2-vae-tts-mal")
35
- tokenizer = AutoTokenizer.from_pretrained("facebook/gpt2-vae-tts-mal")
36
-
37
- text = "കള്ളാ കടയാടി മോനെ"
38
- inputs = tokenizer(text, return_tensors="pt")
39
-
40
- with torch.no_grad():
41
- output = model(**inputs).waveform
42
- ```
43
-
44
- The resulting waveform can be saved as a `.wav` file:
45
-
46
- ```python
47
- import scipy
48
-
49
- scipy.io.wavfile.write("kadayadi_mone.wav", rate=model.config.sampling_rate, data=output)
50
- ```
51
-
52
- Or displayed in a Jupyter Notebook / Google Colab:
53
-
54
- ```python
55
- from IPython.display import Audio
56
-
57
- Audio(output, rate=model.config.sampling_rate)
58
- ```
59
-
60
- ## License
61
-
 
62
  The model is licensed as **CC-BY-NC 4.0**.
 
1
+
2
+ ---
3
+ license: cc-by-nc-4.0
4
+ tags:
5
+ - tts
6
+ - gpt2
7
+ - vae
8
+ pipeline_tag: text-to-speech
9
+ ---
10
+
11
+ # Malayalam Text-to-Speech
12
+
13
+ This repository contains the **Malayalam (mal)** language text-to-speech (TTS) model checkpoint.
14
+
15
+ ## Model Details
16
+
17
+ Sura (**S**tochastic **U**nified **R**epresentation for **A**dversarial learning) is an advanced speech synthesis model that generates speech waveforms conditioned on input text sequences. It is based on a conditional variational autoencoder (VAE) architecture.
18
+
19
+ The model's text encoder is built on Wav2Vec2 decoder, while the decoder is a VAE with 124M parameters. The flow-based module predicts spectrogram-based acoustic features, which is composed of the GPT-2-based encoder and cascaded dense layers. The spectrogram is then transformed into a speech waveform using a stack of transposed convolutional layers. To capture the one-to-many nature of TTS, where the same text can be spoken in multiple ways, the model also includes a stochastic duration predictor, allowing for varied speech rhythms from the same text input.
20
+
21
+ Sura is trained end-to-end using a combination of losses from the variational lower bound and adversarial training techniques. During inference, the text encodings are up-sampled based on the predicted durations, and subsequently mapped into the waveform via the flow module and the VAE decoder. Due to the stochastic nature of the duration predictor, the model is non-deterministic and requires a fixed seed to produce identical speech outputs.
22
+
23
+ ## Usage
24
+
25
+ ```
26
+ pip install --upgrade transformers accelerate
27
+ ```
28
+
29
+ Then, run inference with the following code-snippet:
30
+
31
+ ```python
32
+ from transformers import VitsModel, AutoTokenizer
33
+ import torch
34
+
35
+ model = VitsModel.from_pretrained("aoxo/gpt2-vae-tts-mal")
36
+ tokenizer = AutoTokenizer.from_pretrained("facebook/gpt2-vae-tts-mal")
37
+
38
+ text = "കള്ളാ കടയാടി മോനെ"
39
+ inputs = tokenizer(text, return_tensors="pt")
40
+
41
+ with torch.no_grad():
42
+ output = model(**inputs).waveform
43
+ ```
44
+
45
+ The resulting waveform can be saved as a `.wav` file:
46
+
47
+ ```python
48
+ import scipy
49
+
50
+ scipy.io.wavfile.write("kadayadi_mone.wav", rate=model.config.sampling_rate, data=output)
51
+ ```
52
+
53
+ Or displayed in a Jupyter Notebook / Google Colab:
54
+
55
+ ```python
56
+ from IPython.display import Audio
57
+
58
+ Audio(output, rate=model.config.sampling_rate)
59
+ ```
60
+
61
+ ## License
62
+
63
  The model is licensed as **CC-BY-NC 4.0**.