aoxo
/

swaram

Model card Files Files and versions Community

swaram / README.md

aoxo's picture

Update README.md

e43e6fc verified 28 days ago

|

1.94 kB


	---
	license: cc-by-nc-4.0
	tags:
	- tts
	- gpt2
	- vae
	pipeline_tag: text-to-speech
	---

	# Malayalam Text-to-Speech

	This repository contains the Swaram (mal) text-to-speech (TTS) model checkpoint.

	## Model Details

	Swaram (Stochastic Waveform Adaptive Recurrent Autoencoder for Malayalam) is an advanced speech synthesis model that generates speech waveforms conditioned on input text sequences. It is based on a conditional variational autoencoder (VAE) architecture.

	Swaram's text encoder is built on top of the Wav2Vec2 decoder. A VAE is used as the decoder. A flow-based module predicts spectrogram-based acoustic features, which is composed of the Transformer-based Contextualizer and cascaded dense layers. The spectrogram is then transformed into a speech waveform using a stack of transposed convolutional layers. To capture the one-to-many nature of TTS, where the same text can be spoken in multiple ways, the model also includes a stochastic duration predictor, allowing for varied speech rhythms from the same text input.

	## Usage

	```
	pip install --upgrade transformers accelerate
	```

	Then, run inference with the following code-snippet:

	```python
	from transformers import VitsModel, AutoTokenizer
	import torch

	model = VitsModel.from_pretrained("aoxo/swaram")
	tokenizer = AutoTokenizer.from_pretrained("aoxo/swaram")

	text = "കള്ളാ കടയാടി മോനെ"
	inputs = tokenizer(text, return_tensors="pt")

	with torch.no_grad():
	output = model(**inputs).waveform
	```

	The resulting waveform can be saved as a `.wav` file:

	```python
	import scipy

	scipy.io.wavfile.write("kadayadi_mone.wav", rate=model.config.sampling_rate, data=output)
	```

	Or displayed in a Jupyter Notebook / Google Colab:

	```python
	from IPython.display import Audio

	Audio(output, rate=model.config.sampling_rate)
	```

	## License

	The model is licensed as CC-BY-NC 4.0.