aoxo
/

swaram

@@ -14,11 +14,9 @@ This repository contains the **Swaram (mal)** text-to-speech (TTS) model checkpo
 ## Model Details
-**Swaram** (**S**tochastic **W**aveform **A**daptive **R**ecurrent **A**utoencoder for **M**alayalam) is an advanced speech synthesis model that generates speech waveforms conditioned on input text sequences. It is based on a conditional variational autoencoder (VAE) architecture.
-The model's text encoder is built on Wav2Vec2 decoder, while the decoder is a VAE. A flow-based module predicts spectrogram-based acoustic features, which is composed of the Transformer-based Contextualizer and cascaded dense layers. The spectrogram is then transformed into a speech waveform using a stack of transposed convolutional layers. To capture the one-to-many nature of TTS, where the same text can be spoken in multiple ways, the model also includes a stochastic duration predictor, allowing for varied speech rhythms from the same text input.
-Sura is trained end-to-end using a combination of losses from the variational lower bound and adversarial training techniques. During inference, the text encodings are up-sampled based on the predicted durations, and subsequently mapped into the waveform via the flow module and the VAE decoder. Due to the stochastic nature of the duration predictor, the model is non-deterministic and requires a fixed seed to produce identical speech outputs.
 ## Usage

 ## Model Details
+**Swaram** (**S**tochastic **W**aveform **A**daptive **R**ecurrent **A**utoencoder for **M**alayalam) is an advanced speech synthesis model that generates speech waveforms conditioned on input text sequences. It is based on a **conditional variational autoencoder** (VAE) architecture.
+Swaram's text encoder is built on top of the **Wav2Vec2 decoder**. A **VAE** is used as the decoder. A **flow-based module** predicts **spectrogram-based acoustic features**, which is composed of the **Transformer-based Contextualizer** and cascaded dense layers. The spectrogram is then transformed into a speech waveform using a stack of **transposed convolutional layers**. To capture the one-to-many nature of TTS, where the same text can be spoken in multiple ways, the model also includes a stochastic duration predictor, allowing for varied speech rhythms from the same text input.
 ## Usage