title: Emotion Aware TTS
emoji: π
colorFrom: yellow
colorTo: gray
sdk: gradio
sdk_version: 3.28.1
app_file: app.py
pinned: false
license: mit
This is a PyTorch implementation of Microsoft's text-to-speech system FastSpeech 2: Fast and High-Quality End-to-End Text to Speech. This project is based on xcmyz's implementation of FastSpeech. Feel free to use/modify the code.
There are several versions of FastSpeech 2. This implementation is more similar to version 1, which uses F0 values as the pitch features. On the other hand, pitch spectrograms extracted by continuous wavelet transform are used as the pitch features in the later versions.
Dependencies
You can install the Python dependencies with
pip3 install -r requirements.txt
Inference
You have to download the pretrained model and put it in output/ckpt/EmoV_DB/
.
For English multi-speaker Emotional Aware TTS with RoBERTa embeddings,set multi_emotion to True in model.yaml and run
!python3 synthesize.py --text "Example text" --bert_embed 1 --speaker_id SPEAKER_ID --restore_step 900000 --mode single -p config/EmoV_DB/preprocess.yaml -m config/EmoV_DB/model.yaml -t config/EmoV_DB/train.yaml
For English multi-speaker Emotional Aware TTS with multi-emotions, set multi_emotion to True in model.yaml and run
!python3 synthesize.py --text "Example text" --emotion_id EMOTION_ID --speaker_id SPEAKER_ID --restore_step 900000 --mode single -p config/EmoV_DB/preprocess.yaml -m config/EmoV_DB/model.yaml -t config/EmoV_DB/train.yaml
The generated utterances will be put in output/result/
.
Controllability
The pitch/volume/speaking rate of the synthesized utterances can be controlled by specifying the desired pitch/energy/duration ratios. For example, one can increase the speaking rate by 20 % and decrease the volume by 20 % by
!python3 synthesize.py --text "Example text" --emotion_id EMOTION_ID --speaker_id SPEAKER_ID --restore_step 900000 --mode single -p config/EmoV_DB/preprocess.yaml -m config/EmoV_DB/model.yaml -t config/EmoV_DB/train.yaml --duration_control 0.8 --energy_control 0.8
Training
Datasets
The supported datasets are
- EmoV_DB: a multi-speaker Emotional Voices English dataset consists of four speakers, two females and two males. The emotions expressed in the recordings are amusement, anger, sleepiness, and disgust.
Preprocessing and Training
Follow the steps in Instructions
Implementation Issues
- Following xcmyz's implementation, I use an additional Tacotron-2-styled Post-Net after the decoder, which is not used in the original FastSpeech 2.
- Gradient clipping is used in the training.
- In my experience, using phoneme-level pitch and energy prediction instead of frame-level prediction results in much better prosody, and normalizing the pitch and energy features also helps. Please refer to
config/README.md
for more details.
Please inform me if you find any mistakes in this repo, or any useful tips to train the FastSpeech 2 model.
References
- FastSpeech 2: Fast and High-Quality End-to-End Text to Speech, Y. Ren, et al.
- Text aware Emotional Text-to-speech with BERT
- xcmyz's FastSpeech implementation
- TensorSpeech's FastSpeech 2 implementation
- rishikksh20's FastSpeech 2 implementation
Citation
@INPROCEEDINGS{chien2021investigating,
author={Chien, Chung-Ming and Lin, Jheng-Hao and Huang, Chien-yu and Hsu, Po-chun and Lee, Hung-yi},
booktitle={ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
title={Investigating on Incorporating Pretrained and Learnable Speaker Representations for Multi-Speaker Multi-Style Text-to-Speech},
year={2021},
volume={},
number={},
pages={8588-8592},
doi={10.1109/ICASSP39728.2021.9413880}}