swaram / README.md
aoxo's picture
Update README.md
e43e6fc verified
|
raw
history blame
1.94 kB
---
license: cc-by-nc-4.0
tags:
- tts
- gpt2
- vae
pipeline_tag: text-to-speech
---
# Malayalam Text-to-Speech
This repository contains the **Swaram (mal)** text-to-speech (TTS) model checkpoint.
## Model Details
**Swaram** (**S**tochastic **W**aveform **A**daptive **R**ecurrent **A**utoencoder for **M**alayalam) is an advanced speech synthesis model that generates speech waveforms conditioned on input text sequences. It is based on a **conditional variational autoencoder** (VAE) architecture.
Swaram's text encoder is built on top of the **Wav2Vec2 decoder**. A **VAE** is used as the decoder. A **flow-based module** predicts **spectrogram-based acoustic features**, which is composed of the **Transformer-based Contextualizer** and cascaded dense layers. The spectrogram is then transformed into a speech waveform using a stack of **transposed convolutional layers**. To capture the one-to-many nature of TTS, where the same text can be spoken in multiple ways, the model also includes a stochastic duration predictor, allowing for varied speech rhythms from the same text input.
## Usage
```
pip install --upgrade transformers accelerate
```
Then, run inference with the following code-snippet:
```python
from transformers import VitsModel, AutoTokenizer
import torch
model = VitsModel.from_pretrained("aoxo/swaram")
tokenizer = AutoTokenizer.from_pretrained("aoxo/swaram")
text = "കള്ളാ കടയാടി മോനെ"
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
output = model(**inputs).waveform
```
The resulting waveform can be saved as a `.wav` file:
```python
import scipy
scipy.io.wavfile.write("kadayadi_mone.wav", rate=model.config.sampling_rate, data=output)
```
Or displayed in a Jupyter Notebook / Google Colab:
```python
from IPython.display import Audio
Audio(output, rate=model.config.sampling_rate)
```
## License
The model is licensed as **CC-BY-NC 4.0**.