Spaces:
Build error
Build error
File size: 6,227 Bytes
5421bab |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 |
[[Back]](..)
# LJSpeech
[LJSpeech](https://keithito.com/LJ-Speech-Dataset) is a public domain TTS
corpus with around 24 hours of English speech sampled at 22.05kHz. We provide examples for building
[Transformer](https://arxiv.org/abs/1809.08895) and [FastSpeech 2](https://arxiv.org/abs/2006.04558)
models on this dataset.
## Data preparation
Download data, create splits and generate audio manifests with
```bash
python -m examples.speech_synthesis.preprocessing.get_ljspeech_audio_manifest \
--output-data-root ${AUDIO_DATA_ROOT} \
--output-manifest-root ${AUDIO_MANIFEST_ROOT}
```
Then, extract log-Mel spectrograms, generate feature manifest and create data configuration YAML with
```bash
python -m examples.speech_synthesis.preprocessing.get_feature_manifest \
--audio-manifest-root ${AUDIO_MANIFEST_ROOT} \
--output-root ${FEATURE_MANIFEST_ROOT} \
--ipa-vocab --use-g2p
```
where we use phoneme inputs (`--ipa-vocab --use-g2p`) as example.
FastSpeech 2 additionally requires frame durations, pitch and energy as auxiliary training targets.
Add `--add-fastspeech-targets` to include these fields in the feature manifests. We get frame durations either from
phoneme-level force-alignment or frame-level pseudo-text unit sequence. They should be pre-computed and specified via:
- `--textgrid-zip ${TEXT_GRID_ZIP_PATH}` for a ZIP file, inside which there is one
[TextGrid](https://www.fon.hum.uva.nl/praat/manual/TextGrid.html) file per sample to provide force-alignment info.
- `--id-to-units-tsv ${ID_TO_UNIT_TSV}` for a TSV file, where there are 2 columns for sample ID and
space-delimited pseudo-text unit sequence, respectively.
For your convenience, we provide pre-computed
[force-alignment](https://dl.fbaipublicfiles.com/fairseq/s2/ljspeech_mfa.zip) from
[Montreal Forced Aligner](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) and
[pseudo-text units](s3://dl.fbaipublicfiles.com/fairseq/s2/ljspeech_hubert.tsv) from
[HuBERT](https://github.com/pytorch/fairseq/tree/main/examples/hubert). You can also generate them by yourself using
a different software or model.
## Training
#### Transformer
```bash
fairseq-train ${FEATURE_MANIFEST_ROOT} --save-dir ${SAVE_DIR} \
--config-yaml config.yaml --train-subset train --valid-subset dev \
--num-workers 4 --max-tokens 30000 --max-update 200000 \
--task text_to_speech --criterion tacotron2 --arch tts_transformer \
--clip-norm 5.0 --n-frames-per-step 4 --bce-pos-weight 5.0 \
--dropout 0.1 --attention-dropout 0.1 --activation-dropout 0.1 \
--encoder-normalize-before --decoder-normalize-before \
--optimizer adam --lr 2e-3 --lr-scheduler inverse_sqrt --warmup-updates 4000 \
--seed 1 --update-freq 8 --eval-inference --best-checkpoint-metric mcd_loss
```
where `SAVE_DIR` is the checkpoint root path. We set `--update-freq 8` to simulate 8 GPUs with 1 GPU. You may want to
update it accordingly when using more than 1 GPU.
#### FastSpeech2
```bash
fairseq-train ${FEATURE_MANIFEST_ROOT} --save-dir ${SAVE_DIR} \
--config-yaml config.yaml --train-subset train --valid-subset dev \
--num-workers 4 --max-sentences 6 --max-update 200000 \
--task text_to_speech --criterion fastspeech2 --arch fastspeech2 \
--clip-norm 5.0 --n-frames-per-step 1 \
--dropout 0.1 --attention-dropout 0.1 --activation-dropout 0.1 \
--encoder-normalize-before --decoder-normalize-before \
--optimizer adam --lr 5e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 \
--seed 1 --update-freq 8 --eval-inference --best-checkpoint-metric mcd_loss
```
## Inference
Average the last 5 checkpoints, generate the test split spectrogram and waveform using the default Griffin-Lim vocoder:
```bash
SPLIT=test
CHECKPOINT_NAME=avg_last_5
CHECKPOINT_PATH=${SAVE_DIR}/checkpoint_${CHECKPOINT_NAME}.pt
python scripts/average_checkpoints.py --inputs ${SAVE_DIR} \
--num-epoch-checkpoints 5 \
--output ${CHECKPOINT_PATH}
python -m examples.speech_synthesis.generate_waveform ${FEATURE_MANIFEST_ROOT} \
--config-yaml config.yaml --gen-subset ${SPLIT} --task text_to_speech \
--path ${CHECKPOINT_PATH} --max-tokens 50000 --spec-bwd-max-iter 32 \
--dump-waveforms
```
which dumps files (waveform, feature, attention plot, etc.) to `${SAVE_DIR}/generate-${CHECKPOINT_NAME}-${SPLIT}`. To
re-synthesize target waveforms for automatic evaluation, add `--dump-target`.
## Automatic Evaluation
To start with, generate the manifest for synthetic speech, which will be taken as inputs by evaluation scripts.
```bash
python -m examples.speech_synthesis.evaluation.get_eval_manifest \
--generation-root ${SAVE_DIR}/generate-${CHECKPOINT_NAME}-${SPLIT} \
--audio-manifest ${AUDIO_MANIFEST_ROOT}/${SPLIT}.audio.tsv \
--output-path ${EVAL_OUTPUT_ROOT}/eval.tsv \
--vocoder griffin_lim --sample-rate 22050 --audio-format flac \
--use-resynthesized-target
```
Speech recognition (ASR) models usually operate at lower sample rates (e.g. 16kHz). For the WER/CER metric,
you may need to resample the audios accordingly --- add `--output-sample-rate 16000` for `generate_waveform.py` and
use `--sample-rate 16000` for `get_eval_manifest.py`.
#### WER/CER metric
We use wav2vec 2.0 ASR model as example. [Download](https://github.com/pytorch/fairseq/tree/main/examples/wav2vec)
the model checkpoint and dictionary, then compute WER/CER with
```bash
python -m examples.speech_synthesis.evaluation.eval_asr \
--audio-header syn --text-header text --err-unit char --split ${SPLIT} \
--w2v-ckpt ${WAV2VEC2_CHECKPOINT_PATH} --w2v-dict-dir ${WAV2VEC2_DICT_DIR} \
--raw-manifest ${EVAL_OUTPUT_ROOT}/eval_16khz.tsv --asr-dir ${EVAL_OUTPUT_ROOT}/asr
```
#### MCD/MSD metric
```bash
python -m examples.speech_synthesis.evaluation.eval_sp \
${EVAL_OUTPUT_ROOT}/eval.tsv --mcd --msd
```
#### F0 metrics
```bash
python -m examples.speech_synthesis.evaluation.eval_f0 \
${EVAL_OUTPUT_ROOT}/eval.tsv --gpe --vde --ffe
```
## Results
| --arch | Params | Test MCD | Model |
|---|---|---|---|
| tts_transformer | 54M | 3.8 | [Download](https://dl.fbaipublicfiles.com/fairseq/s2/ljspeech_transformer_phn.tar) |
| fastspeech2 | 41M | 3.8 | [Download](https://dl.fbaipublicfiles.com/fairseq/s2/ljspeech_fastspeech2_phn.tar) |
[[Back]](..)
|