File size: 1,993 Bytes
495c0df e3f11fa 495c0df e43e6fc 495c0df e43e6fc 495c0df f3483e8 495c0df 48e9975 495c0df 0d48494 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 |
---
license: cc-by-nc-4.0
tags:
- tts
- gpt2
- vae
pipeline_tag: text-to-speech
---
# Malayalam Text-to-Speech
This repository contains the **Swaram (mal)** text-to-speech (TTS) model checkpoint.
## Model Details
**Swaram** (**S**tochastic **W**aveform **A**daptive **R**ecurrent **A**utoencoder for **M**alayalam) is an advanced speech synthesis model that generates speech waveforms conditioned on input text sequences. It is based on a **conditional variational autoencoder** (VAE) architecture.
Swaram's text encoder is built on top of the **Wav2Vec2 decoder**. A **VAE** is used as the decoder. A **flow-based module** predicts **spectrogram-based acoustic features**, which is composed of the **Transformer-based Contextualizer** and cascaded dense layers. The spectrogram is then transformed into a speech waveform using a stack of **transposed convolutional layers**. To capture the one-to-many nature of TTS, where the same text can be spoken in multiple ways, the model also includes a stochastic duration predictor, allowing for varied speech rhythms from the same text input.
## Architecture
![architecture](architecture.png)
## Usage
```
pip install --upgrade transformers accelerate
```
Then, run inference with the following code-snippet:
```python
from transformers import VitsModel, AutoTokenizer
import torch
model = VitsModel.from_pretrained("aoxo/swaram")
tokenizer = AutoTokenizer.from_pretrained("aoxo/swaram")
text = "കള്ളാ കടയാടി മോനെ"
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
output = model(**inputs).waveform
```
The resulting waveform can be saved as a `.wav` file:
```python
import scipy
scipy.io.wavfile.write("kadayadi_mone.wav", rate=model.config.sampling_rate, data=output)
```
Or displayed in a Jupyter Notebook / Google Colab:
```python
from IPython.display import Audio
Audio(output, rate=model.config.sampling_rate)
```
## License
The model is licensed as **CC-BY-NC 4.0**. |