LJSpeech

LJSpeech is a public domain TTS corpus with around 24 hours of English speech sampled at 22.05kHz. We provide examples for building Transformer and FastSpeech 2 models on this dataset.

Data preparation

Download data, create splits and generate audio manifests with

python -m examples.speech_synthesis.preprocessing.get_ljspeech_audio_manifest \
  --output-data-root ${AUDIO_DATA_ROOT} \
  --output-manifest-root ${AUDIO_MANIFEST_ROOT}

Then, extract log-Mel spectrograms, generate feature manifest and create data configuration YAML with

python -m examples.speech_synthesis.preprocessing.get_feature_manifest \
  --audio-manifest-root ${AUDIO_MANIFEST_ROOT} \
  --output-root ${FEATURE_MANIFEST_ROOT} \
  --ipa-vocab --use-g2p

where we use phoneme inputs (--ipa-vocab --use-g2p) as example.

FastSpeech 2 additionally requires frame durations, pitch and energy as auxiliary training targets. Add --add-fastspeech-targets to include these fields in the feature manifests. We get frame durations either from phoneme-level force-alignment or frame-level pseudo-text unit sequence. They should be pre-computed and specified via:

--textgrid-zip ${TEXT_GRID_ZIP_PATH} for a ZIP file, inside which there is one TextGrid file per sample to provide force-alignment info.
--id-to-units-tsv ${ID_TO_UNIT_TSV} for a TSV file, where there are 2 columns for sample ID and space-delimited pseudo-text unit sequence, respectively.

For your convenience, we provide pre-computed force-alignment from Montreal Forced Aligner and pseudo-text units from HuBERT. You can also generate them by yourself using a different software or model.

Training

Transformer

fairseq-train ${FEATURE_MANIFEST_ROOT} --save-dir ${SAVE_DIR} \
  --config-yaml config.yaml --train-subset train --valid-subset dev \
  --num-workers 4 --max-tokens 30000 --max-update 200000 \
  --task text_to_speech --criterion tacotron2 --arch tts_transformer \
  --clip-norm 5.0 --n-frames-per-step 4 --bce-pos-weight 5.0 \
  --dropout 0.1 --attention-dropout 0.1 --activation-dropout 0.1 \
  --encoder-normalize-before --decoder-normalize-before \
  --optimizer adam --lr 2e-3 --lr-scheduler inverse_sqrt --warmup-updates 4000 \
  --seed 1 --update-freq 8 --eval-inference --best-checkpoint-metric mcd_loss

where SAVE_DIR is the checkpoint root path. We set --update-freq 8 to simulate 8 GPUs with 1 GPU. You may want to update it accordingly when using more than 1 GPU.

FastSpeech2

fairseq-train ${FEATURE_MANIFEST_ROOT} --save-dir ${SAVE_DIR} \
  --config-yaml config.yaml --train-subset train --valid-subset dev \
  --num-workers 4 --max-sentences 6 --max-update 200000 \
  --task text_to_speech --criterion fastspeech2 --arch fastspeech2 \
  --clip-norm 5.0 --n-frames-per-step 1 \
  --dropout 0.1 --attention-dropout 0.1 --activation-dropout 0.1 \
  --encoder-normalize-before --decoder-normalize-before \
  --optimizer adam --lr 5e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 \
  --seed 1 --update-freq 8 --eval-inference --best-checkpoint-metric mcd_loss

Inference

Average the last 5 checkpoints, generate the test split spectrogram and waveform using the default Griffin-Lim vocoder:

SPLIT=test
CHECKPOINT_NAME=avg_last_5
CHECKPOINT_PATH=${SAVE_DIR}/checkpoint_${CHECKPOINT_NAME}.pt
python scripts/average_checkpoints.py --inputs ${SAVE_DIR} \
  --num-epoch-checkpoints 5 \
  --output ${CHECKPOINT_PATH}

python -m examples.speech_synthesis.generate_waveform ${FEATURE_MANIFEST_ROOT} \
  --config-yaml config.yaml --gen-subset ${SPLIT} --task text_to_speech \
  --path ${CHECKPOINT_PATH} --max-tokens 50000 --spec-bwd-max-iter 32 \
  --dump-waveforms

which dumps files (waveform, feature, attention plot, etc.) to ${SAVE_DIR}/generate-${CHECKPOINT_NAME}-${SPLIT}. To re-synthesize target waveforms for automatic evaluation, add --dump-target.

Automatic Evaluation

To start with, generate the manifest for synthetic speech, which will be taken as inputs by evaluation scripts.

python -m examples.speech_synthesis.evaluation.get_eval_manifest \
  --generation-root ${SAVE_DIR}/generate-${CHECKPOINT_NAME}-${SPLIT} \
  --audio-manifest ${AUDIO_MANIFEST_ROOT}/${SPLIT}.audio.tsv \
  --output-path ${EVAL_OUTPUT_ROOT}/eval.tsv \
  --vocoder griffin_lim --sample-rate 22050 --audio-format flac \
  --use-resynthesized-target

Speech recognition (ASR) models usually operate at lower sample rates (e.g. 16kHz). For the WER/CER metric, you may need to resample the audios accordingly --- add --output-sample-rate 16000 for generate_waveform.py and use --sample-rate 16000 for get_eval_manifest.py.

WER/CER metric

We use wav2vec 2.0 ASR model as example. Download the model checkpoint and dictionary, then compute WER/CER with

python -m examples.speech_synthesis.evaluation.eval_asr \
  --audio-header syn --text-header text --err-unit char --split ${SPLIT} \
  --w2v-ckpt ${WAV2VEC2_CHECKPOINT_PATH} --w2v-dict-dir ${WAV2VEC2_DICT_DIR} \
  --raw-manifest ${EVAL_OUTPUT_ROOT}/eval_16khz.tsv --asr-dir ${EVAL_OUTPUT_ROOT}/asr

MCD/MSD metric

python -m examples.speech_synthesis.evaluation.eval_sp \
  ${EVAL_OUTPUT_ROOT}/eval.tsv --mcd --msd

F0 metrics

python -m examples.speech_synthesis.evaluation.eval_f0 \
  ${EVAL_OUTPUT_ROOT}/eval.tsv --gpe --vde --ffe

Results

--arch	Params	Test MCD	Model
tts_transformer	54M	3.8	Download
fastspeech2	41M	3.8	Download

[Back]