|
|
|
--- |
|
license: cc-by-nc-4.0 |
|
tags: |
|
- tts |
|
- gpt2 |
|
- vae |
|
pipeline_tag: text-to-speech |
|
--- |
|
|
|
# Malayalam Text-to-Speech |
|
|
|
This repository contains the **Swaram (mal)** text-to-speech (TTS) model checkpoint. |
|
|
|
## Model Details |
|
|
|
**Swaram** (**S**tochastic **W**aveform **A**daptive **R**ecurrent **A**utoencoder for **M**alayalam) is an advanced speech synthesis model that generates speech waveforms conditioned on input text sequences. It is based on a **conditional variational autoencoder** (VAE) architecture. |
|
|
|
Swaram's text encoder is built on top of the **Wav2Vec2 decoder**. A **VAE** is used as the decoder. A **flow-based module** predicts **spectrogram-based acoustic features**, which is composed of the **Transformer-based Contextualizer** and cascaded dense layers. The spectrogram is then transformed into a speech waveform using a stack of **transposed convolutional layers**. To capture the one-to-many nature of TTS, where the same text can be spoken in multiple ways, the model also includes a stochastic duration predictor, allowing for varied speech rhythms from the same text input. |
|
|
|
## Usage |
|
|
|
``` |
|
pip install --upgrade transformers accelerate |
|
``` |
|
|
|
Then, run inference with the following code-snippet: |
|
|
|
```python |
|
from transformers import VitsModel, AutoTokenizer |
|
import torch |
|
|
|
model = VitsModel.from_pretrained("aoxo/swaram") |
|
tokenizer = AutoTokenizer.from_pretrained("aoxo/swaram") |
|
|
|
text = "കള്ളാ കടയാടി മോനെ" |
|
inputs = tokenizer(text, return_tensors="pt") |
|
|
|
with torch.no_grad(): |
|
output = model(**inputs).waveform |
|
``` |
|
|
|
The resulting waveform can be saved as a `.wav` file: |
|
|
|
```python |
|
import scipy |
|
|
|
scipy.io.wavfile.write("kadayadi_mone.wav", rate=model.config.sampling_rate, data=output) |
|
``` |
|
|
|
Or displayed in a Jupyter Notebook / Google Colab: |
|
|
|
```python |
|
from IPython.display import Audio |
|
|
|
Audio(output, rate=model.config.sampling_rate) |
|
``` |
|
|
|
## License |
|
|
|
The model is licensed as **CC-BY-NC 4.0**. |