Update README.md
Browse files
README.md
CHANGED
@@ -10,11 +10,11 @@ pipeline_tag: text-to-speech
|
|
10 |
|
11 |
# Malayalam Text-to-Speech
|
12 |
|
13 |
-
This repository contains the **
|
14 |
|
15 |
## Model Details
|
16 |
|
17 |
-
Swaram (**S**tochastic **W**aveform **A**daptive **R**ecurrent **A**utoencoder for **M**alayalam) is an advanced speech synthesis model that generates speech waveforms conditioned on input text sequences. It is based on a conditional variational autoencoder (VAE) architecture.
|
18 |
|
19 |
The model's text encoder is built on Wav2Vec2 decoder, while the decoder is a VAE. A flow-based module predicts spectrogram-based acoustic features, which is composed of the Transformer-based Contextualizer and cascaded dense layers. The spectrogram is then transformed into a speech waveform using a stack of transposed convolutional layers. To capture the one-to-many nature of TTS, where the same text can be spoken in multiple ways, the model also includes a stochastic duration predictor, allowing for varied speech rhythms from the same text input.
|
20 |
|
|
|
10 |
|
11 |
# Malayalam Text-to-Speech
|
12 |
|
13 |
+
This repository contains the **Swaram (mal)** text-to-speech (TTS) model checkpoint.
|
14 |
|
15 |
## Model Details
|
16 |
|
17 |
+
**Swaram** (**S**tochastic **W**aveform **A**daptive **R**ecurrent **A**utoencoder for **M**alayalam) is an advanced speech synthesis model that generates speech waveforms conditioned on input text sequences. It is based on a conditional variational autoencoder (VAE) architecture.
|
18 |
|
19 |
The model's text encoder is built on Wav2Vec2 decoder, while the decoder is a VAE. A flow-based module predicts spectrogram-based acoustic features, which is composed of the Transformer-based Contextualizer and cascaded dense layers. The spectrogram is then transformed into a speech waveform using a stack of transposed convolutional layers. To capture the one-to-many nature of TTS, where the same text can be spoken in multiple ways, the model also includes a stochastic duration predictor, allowing for varied speech rhythms from the same text input.
|
20 |
|