File size: 6,144 Bytes
a1d9110 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 |
# LJSpeech
[LJSpeech]( is a public domain TTS
corpus with around 24 hours of English speech sampled at 22.05kHz. We provide examples for building
[Transformer]( and [FastSpeech 2](
models on this dataset.
## Data preparation
Download data, create splits and generate audio manifests with
python -m examples.speech_synthesis.preprocessing.get_ljspeech_audio_manifest \
--output-data-root ${AUDIO_DATA_ROOT} \
--output-manifest-root ${AUDIO_MANIFEST_ROOT}
Then, extract log-Mel spectrograms, generate feature manifest and create data configuration YAML with
python -m examples.speech_synthesis.preprocessing.get_feature_manifest \
--audio-manifest-root ${AUDIO_MANIFEST_ROOT} \
--output-root ${FEATURE_MANIFEST_ROOT} \
--ipa-vocab --use-g2p
where we use phoneme inputs (`--ipa-vocab --use-g2p`) as example.
FastSpeech 2 additionally requires frame durations, pitch and energy as auxiliary training targets.
Add `--add-fastspeech-targets` to include these fields in the feature manifests. We get frame durations either from
phoneme-level force-alignment or frame-level pseudo-text unit sequence. They should be pre-computed and specified via:
- `--textgrid-zip ${TEXT_GRID_ZIP_PATH}` for a ZIP file, inside which there is one
[TextGrid]( file per sample to provide force-alignment info.
- `--id-to-units-tsv ${ID_TO_UNIT_TSV}` for a TSV file, where there are 2 columns for sample ID and
space-delimited pseudo-text unit sequence, respectively.
For your convenience, we provide pre-computed
[force-alignment]( from
[Montreal Forced Aligner]( and
[pseudo-text units](s3:// from
[HuBERT]( You can also generate them by yourself using
a different software or model.
## Training
#### Transformer
fairseq-train ${FEATURE_MANIFEST_ROOT} --save-dir ${SAVE_DIR} \
--config-yaml config.yaml --train-subset train --valid-subset dev \
--num-workers 4 --max-tokens 30000 --max-update 200000 \
--task text_to_speech --criterion tacotron2 --arch tts_transformer \
--clip-norm 5.0 --n-frames-per-step 4 --bce-pos-weight 5.0 \
--dropout 0.1 --attention-dropout 0.1 --activation-dropout 0.1 \
--encoder-normalize-before --decoder-normalize-before \
--optimizer adam --lr 2e-3 --lr-scheduler inverse_sqrt --warmup-updates 4000 \
--seed 1 --update-freq 8 --eval-inference --best-checkpoint-metric mcd_loss
where `SAVE_DIR` is the checkpoint root path. We set `--update-freq 8` to simulate 8 GPUs with 1 GPU. You may want to
update it accordingly when using more than 1 GPU.
#### FastSpeech2
fairseq-train ${FEATURE_MANIFEST_ROOT} --save-dir ${SAVE_DIR} \
--config-yaml config.yaml --train-subset train --valid-subset dev \
--num-workers 4 --max-sentences 6 --max-update 200000 \
--task text_to_speech --criterion fastspeech2 --arch fastspeech2 \
--clip-norm 5.0 --n-frames-per-step 1 \
--dropout 0.1 --attention-dropout 0.1 \
--optimizer adam --lr 5e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 \
--seed 1 --update-freq 8 --eval-inference --best-checkpoint-metric mcd_loss
## Inference
Average the last 5 checkpoints, generate the test split spectrogram and waveform using the default Griffin-Lim vocoder:
python scripts/ --inputs ${SAVE_DIR} \
--num-epoch-checkpoints 5 \
python -m examples.speech_synthesis.generate_waveform ${FEATURE_MANIFEST_ROOT} \
--config-yaml config.yaml --gen-subset ${SPLIT} --task text_to_speech \
--path ${CHECKPOINT_PATH} --max-tokens 50000 --spec-bwd-max-iter 32 \
which dumps files (waveform, feature, attention plot, etc.) to `${SAVE_DIR}/generate-${CHECKPOINT_NAME}-${SPLIT}`. To
re-synthesize target waveforms for automatic evaluation, add `--dump-target`.
## Automatic Evaluation
To start with, generate the manifest for synthetic speech, which will be taken as inputs by evaluation scripts.
python -m examples.speech_synthesis.evaluation.get_eval_manifest \
--generation-root ${SAVE_DIR}/generate-${CHECKPOINT_NAME}-${SPLIT} \
--audio-manifest ${AUDIO_MANIFEST_ROOT}/${SPLIT}.audio.tsv \
--output-path ${EVAL_OUTPUT_ROOT}/eval.tsv \
--vocoder griffin_lim --sample-rate 22050 --audio-format flac \
Speech recognition (ASR) models usually operate at lower sample rates (e.g. 16kHz). For the WER/CER metric,
you may need to resample the audios accordingly --- add `--output-sample-rate 16000` for `` and
use `--sample-rate 16000` for ``.
#### WER/CER metric
We use wav2vec 2.0 ASR model as example. [Download](
the model checkpoint and dictionary, then compute WER/CER with
python -m examples.speech_synthesis.evaluation.eval_asr \
--audio-header syn --text-header text --err-unit char --split ${SPLIT} \
--w2v-ckpt ${WAV2VEC2_CHECKPOINT_PATH} --w2v-dict-dir ${WAV2VEC2_DICT_DIR} \
--raw-manifest ${EVAL_OUTPUT_ROOT}/eval_16khz.tsv --asr-dir ${EVAL_OUTPUT_ROOT}/asr
#### MCD/MSD metric
python -m examples.speech_synthesis.evaluation.eval_sp \
${EVAL_OUTPUT_ROOT}/eval.tsv --mcd --msd
#### F0 metrics
python -m examples.speech_synthesis.evaluation.eval_f0 \
${EVAL_OUTPUT_ROOT}/eval.tsv --gpe --vde --ffe
## Results
| --arch | Params | Test MCD | Model |
| tts_transformer | 54M | 3.8 | [Download]( |
| fastspeech2 | 41M | 3.8 | [Download]( |