# vakyansh-tts Text to Speech for Indic languages ## 1. Installation and Setup for training Clone repo Note : for multspeaker glow-tts training use branch [multispeaker](https://github.com/Open-Speech-EkStep/vakyansh-tts/tree/multispeaker) ``` git clone https://github.com/Open-Speech-EkStep/vakyansh-tts ``` Build conda virtual environment ``` cd ./vakyansh-tts conda create --name python=3.7 conda activate pip install -r requirements.txt ``` Install [apex](https://github.com/NVIDIA/apex); commit: 37cdaf4 for Mixed-precision training Note : used only for glow-tts ``` cd .. git clone https://github.com/NVIDIA/apex cd apex git checkout 37cdaf4 pip install -v --disable-pip-version-check --no-cache-dir ./ cd ../vakyansh-tts ``` Build Monotonic Alignment Search Code (Cython) Note : used only for glow-tts ``` bash install.sh ``` ## 2. Data Resampling The data format should have a folder containing all the .wav files for glow-tts and a text file containing filenames with their sentences. Directory structure: langauge_folder_name ``` language_folder_name |-- ./wav/*.wav |-- ./text_file_name.txt ``` The format for text_file_name.txt (Text file is only needed for glow-tts training) ``` ( audio1.wav "Sentence1." ) ( audio2.wav "Sentence2." ) ``` To resample the .wav files to 22050 sample rate, change the following parameters in the vakyansh-tts/scripts/data/resample.sh ``` input_wav_path : absolute path to wav file folder in vakyansh_tts/data/ output_wav_path : absolute path to vakyansh_tts/data/resampled_wav_folder_name output_sample_rate : 22050 (or any other desired sample rate) ``` To run: ```bash cd scripts/data/ bash resample.sh ``` ## 3. Spectogram Training (glow-tts) ### 3.1 Data Preparation To prepare the data edit the vakyansh-tts/scripts/glow/prepare_data.sh file and change the following parameters ``` input_text_path : absolute path to vakyansh_tts/data/text_file_name.txt input_wav_path : absolute path to vakyansh_tts/data/resampled_wav_folder_name gender : female or male voice ``` To run: ```bash cd scripts/glow/ bash prepare_data.sh ``` ### 3.2 Training glow-tts To start the spectogram-training edit the vakyansh-tts/scripts/glow/train_glow.sh file and change the following parameter: ``` gender : female or male voice ``` Make sure that the gender is same as that of the prepare_data.sh file To start the training, run: ```bash cd scripts/glow/ bash train_glow.sh ``` ## 4. Vocoder Training (hifi-gan) ### 4.1 Data Preparation To prepare the data edit the vakyansh-tts/scripts/hifi/prepare_data.sh file and change the following parameters ``` input_wav_path : absolute path to vakyansh_tts/data/resampled_wav_folder_name gender : female or male voice ``` To run: ```bash cd scripts/hifi/ bash prepare_data.sh ``` ### 4.2 Training hifi-gan To start the spectogram-training edit the vakyansh-tts/scripts/hifi/train_hifi.sh file and change the following parameter: ``` gender : female or male voice ``` Make sure that the gender is same as that of the prepare_data.sh file To start the training, run: ```bash cd scripts/hifi/ bash train_hifi.sh ``` ## 5. Inference ### 5.1 Using Gradio To use the gradio link edit the following parameters in the vakyansh-tts/scripts/inference/gradio.sh file: ``` gender : female or male voice device : cpu or cuda lang : langauge code ``` To run: ```bash cd scripts/inference/ bash gradio.sh ``` ### 5.2 Using fast API To use the fast api link edit the parameters in the vakyansh-tts/scripts/inference/api.sh file similar to section 5.1 To run: ```bash cd scripts/inference/ bash api.sh ``` ### 5.3 Direct Inference using text To infer, edit the parameters in the vakyansh-tts/scripts/inference/infer.sh file similar to section 5.1 and set the text to the text variable To run: ```bash cd scripts/inference/ bash infer.sh ``` To configure other parameters there is a version that runs the advanced inference as well. Additional Parameters: ``` noise_scale : can vary from 0 to 1 for noise factor length_scale : can vary from 0 to 2 for changing the speed of the generated audio transliteration : whether to switch on/off transliteration. 1: ON, 0: OFF number_conversion : whether to switch on/off number to words conversion. 1: ON, 0: OFF split_sentences : whether to switch on/off splitting of sentences. 1: ON, 0: OFF ``` To run: ``` cd scripts/inference/ bash advanced_infer.sh ``` ### 5.4 Installation of tts_infer package In tts_infer package, we currently have two components: 1. Transliteration (AI4bharat's open sourced models) (Languages supported: {'hi', 'gu', 'mr', 'bn', 'te', 'ta', 'kn', 'pa', 'gom', 'mai', 'ml', 'sd', 'si', 'ur'} ) 2. Num to Word (Languages supported: {'en', 'hi', 'gu', 'mr', 'bn', 'te', 'ta', 'kn', 'or', 'pa'} ) ``` git clone https://github.com/Open-Speech-EkStep/vakyansh-tts cd vakyansh-tts bash install.sh python setup.py bdist_wheel pip install -e . cd tts_infer gsutil -m cp -r gs://vakyaansh-open-models/translit_models . ``` Usage: Refer to example file in tts_infer/ ``` from tts_infer.tts import TextToMel, MelToWav from tts_infer.transliterate import XlitEngine from tts_infer.num_to_word_on_sent import normalize_nums import re from scipy.io.wavfile import write text_to_mel = TextToMel(glow_model_dir='/path/to/glow-tts/checkpoint/dir', device='cuda') mel_to_wav = MelToWav(hifi_model_dir='/path/to/hifi/checkpoint/dir', device='cuda') def translit(text, lang): reg = re.compile(r'[a-zA-Z]') engine = XlitEngine(lang) words = [engine.translit_word(word, topk=1)[lang][0] if reg.match(word) else word for word in text.split()] updated_sent = ' '.join(words) return updated_sent def run_tts(text, lang): text = text.replace('ред', '.') # only for hindi models text_num_to_word = normalize_nums(text, lang) # converting numbers to words in lang text_num_to_word_and_transliterated = translit(text_num_to_word, lang) # transliterating english words to lang mel = text_to_mel.generate_mel(text_num_to_word_and_transliterated) audio, sr = mel_to_wav.generate_wav(mel) write(filename='temp.wav', rate=sr, data=audio) # for saving wav file, if needed return (sr, audio) ```