aoxo commited on
Commit
e3f11fa
1 Parent(s): 48e9975

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -2
README.md CHANGED
@@ -10,11 +10,11 @@ pipeline_tag: text-to-speech
10
 
11
  # Malayalam Text-to-Speech
12
 
13
- This repository contains the **Malayalam (mal)** language text-to-speech (TTS) model checkpoint.
14
 
15
  ## Model Details
16
 
17
- Swaram (**S**tochastic **W**aveform **A**daptive **R**ecurrent **A**utoencoder for **M**alayalam) is an advanced speech synthesis model that generates speech waveforms conditioned on input text sequences. It is based on a conditional variational autoencoder (VAE) architecture.
18
 
19
  The model's text encoder is built on Wav2Vec2 decoder, while the decoder is a VAE. A flow-based module predicts spectrogram-based acoustic features, which is composed of the Transformer-based Contextualizer and cascaded dense layers. The spectrogram is then transformed into a speech waveform using a stack of transposed convolutional layers. To capture the one-to-many nature of TTS, where the same text can be spoken in multiple ways, the model also includes a stochastic duration predictor, allowing for varied speech rhythms from the same text input.
20
 
 
10
 
11
  # Malayalam Text-to-Speech
12
 
13
+ This repository contains the **Swaram (mal)** text-to-speech (TTS) model checkpoint.
14
 
15
  ## Model Details
16
 
17
+ **Swaram** (**S**tochastic **W**aveform **A**daptive **R**ecurrent **A**utoencoder for **M**alayalam) is an advanced speech synthesis model that generates speech waveforms conditioned on input text sequences. It is based on a conditional variational autoencoder (VAE) architecture.
18
 
19
  The model's text encoder is built on Wav2Vec2 decoder, while the decoder is a VAE. A flow-based module predicts spectrogram-based acoustic features, which is composed of the Transformer-based Contextualizer and cascaded dense layers. The spectrogram is then transformed into a speech waveform using a stack of transposed convolutional layers. To capture the one-to-many nature of TTS, where the same text can be spoken in multiple ways, the model also includes a stochastic duration predictor, allowing for varied speech rhythms from the same text input.
20