--- license: cc-by-4.0 --- # Fastspeech2 Model using Hybrid Segmentation (HS) This repository contains a Fastspeech2 Model for 16 Indian languages (male and female both) implemented using the Hybrid Segmentation (HS) for speech synthesis. The model is capable of generating mel-spectrograms from text inputs and can be used to synthesize speech.. ## Model Files The model for each language includes the following files: - `config.yaml`: Configuration file for the Fastspeech2 Model. - `energy_stats.npz`: Energy statistics for normalization during synthesis. - `feats_stats.npz`: Features statistics for normalization during synthesis. - `feats_type`: Features type information. - `pitch_stats.npz`: Pitch statistics for normalization during synthesis. - `model.pth`: Pre-trained Fastspeech2 model weights. ## Installation 1. Install [Miniconda](https://docs.conda.io/projects/miniconda/en/latest/) first. Create a conda environment using the provided `environment.yml` file: ```shell conda env create -f environment.yml ``` 2.Activate the conda environment (check inside environment.yaml file): ```shell conda activate tts-hs-hifigan ``` 3. Install PyTorch separately (you can install the specific version based on your requirements): ```shell conda install pytorch cudatoolkit pip install torchaudio pip install numpy==1.23.0 ``` ## Vocoder For generating WAV files from mel-spectrograms, you can use a vocoder of your choice. One popular option is the [HIFIGAN](https://github.com/jik876/hifi-gan) vocoder (Clone this repo and put it in the current working directory). Please refer to the documentation of the vocoder you choose for installation and usage instructions. (**We have used the HIFIGAN vocoder and have provided Vocoder tuned on Aryan and Dravidian languages**) ## Usage The directory paths are Relative. ( But if needed, Make changes to **text_preprocess_for_inference.py** and **inference.py** file, Update folder/file paths wherever required.) **Please give language/gender in small cases and sample text between quotes. Adjust output speed using the alpha parameter (higher for slow voiced output and vice versa). Output argument is optional; the provide name will be used for the output file.** Use the inference file to synthesize speech from text inputs: ```shell python inference.py --sample_text "Your input text here" --language --gender --alpha --output_file ``` **Example:** ``` python inference.py --sample_text "श्रीलंका और पाकिस्तान में खेला जा रहा एशिया कप अब तक का सबसे विवादित टूर्नामेंट होता जा रहा है।" --language hindi --gender male --alpha 1 --output_file male_hindi_output.wav ``` The file will be stored as `male_hindi_output.wav` and will be inside current working directory. If **--output_file** argument is not given it will be stored as `__output.wav` in the current working directory. ### Citation If you use this Fastspeech2 Model in your research or work, please consider citing: “ COPYRIGHT 2023, Speech Technology Consortium, Bhashini, MeiTY and by Hema A Murthy & S Umesh, DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING and ELECTRICAL ENGINEERING, IIT MADRAS. ALL RIGHTS RESERVED " Shield: [![CC BY 4.0][cc-by-shield]][cc-by] This work is licensed under a [Creative Commons Attribution 4.0 International License][cc-by]. [![CC BY 4.0][cc-by-image]][cc-by] [cc-by]: http://creativecommons.org/licenses/by/4.0/ [cc-by-image]: https://i.creativecommons.org/l/by/4.0/88x31.png [cc-by-shield]: https://img.shields.io/badge/License-CC%20BY%204.0-lightgrey.svg