Spaces:
Sleeping
Sleeping
# Config | |
Here are the config files used to train the single/multi-speaker TTS models. | |
4 different configurations are given: | |
- LJSpeech: suggested configuration for LJSpeech dataset. | |
- LibriTTS: suggested configuration for LibriTTS dataset. | |
- AISHELL3: suggested configuration for AISHELL-3 dataset. | |
- LJSpeech_paper: closed to the setting proposed in the original FastSpeech 2 paper. | |
Some important hyper-parameters are explained here. | |
## preprocess.yaml | |
- **path.lexicon_path**: the lexicon (which maps words to phonemes) used by Montreal Forced Aligner. | |
We provide an English lexicon and a Mandarin lexicon. | |
Erhua (γ¦ει³) is handled in the Mandarin lexicon. | |
- **mel.stft.mel_fmax**: set it to 8000 if HiFi-GAN vocoder is used, and set it to null if MelGAN is used. | |
- **pitch.feature & energy.feature**: the original paper proposed to predict and apply frame-level pitch and energy features to the inputs of the TTS decoder to control the pitch and energy of the synthesized utterances. | |
However, in our experiments, we find that using phoneme-level features makes the prosody of the synthesized utterances more natural. | |
- **pitch.normalization & energy.normalization**: to normalize the pitch and energy values or not. | |
The original paper did not normalize these values. | |
## train.yaml | |
- **optimizer.grad_acc_step**: the number of batches of gradient accumulation before updating the model parameters and call optimizer.zero_grad(), which is useful if you wish to train the model with a large batch size but you do not have sufficient GPU memory. | |
- **optimizer.anneal_steps & optimizer.anneal_rate**: the learning rate is reduced at the **anneal_steps** by the ratio specified with **anneal_rate**. | |
## model.yaml | |
- **transformer.decoder_layer**: the original paper used a 4-layer decoder, but we find it better to use a 6-layer decoder, especially for multi-speaker TTS. | |
- **variance_embedding.pitch_quantization**: when the pitch values are normalized as specified in ``preprocess.yaml``, it is not valid to use log-scale quantization bins as proposed in the original paper, so we use linear-scaled bins instead. | |
- **multi_speaker**: to apply a speaker embedding table to enable multi-speaker TTS or not. | |
- **vocoder.speaker**: should be set to 'universal' if any dataset other than LJSpeech is used. |