MFA based extraction for FastSpeech

Prepare

Everything is done from main repo folder so TensorflowTTS/

Optional* Modify MFA scripts to work with your language (https://montreal-forced-aligner.readthedocs.io/en/latest/pretrained_models.html)
Download pretrained mfa, lexicon and run extract textgrids:

bash examples/mfa_extraction/scripts/prepare_mfa.sh

python examples/mfa_extraction/run_mfa.py \
  --corpus_directory ./libritts \
  --output_directory ./mfa/parsed \
  --jobs 8

After this step, the TextGrids is allocated at ./mfa/parsed.

Extract duration from textgrid files:

python examples/mfa_extraction/txt_grid_parser.py \
  --yaml_path examples/fastspeech2_libritts/conf/fastspeech2libritts.yaml \
  --dataset_path ./libritts \
  --text_grid_path ./mfa/parsed \
  --output_durations_path ./libritts/durations \
  --sample_rate 24000

Dataset structure after finish this step:

|- TensorFlowTTS/
|   |- LibriTTS/
|   |-  |- train-clean-100/
|   |-  |- SPEAKERS.txt
|   |-  |- ...
|   |- dataset/
|   |-  |- 200/
|   |-  |-  |- 200_124139_000001_000000.txt
|   |-  |-  |- 200_124139_000001_000000.wav
|   |-  |-  |- ...
|   |-  |- 250/
|   |-  |- ...
|   |-  |- durations/
|   |-  |- train.txt
|   |- tensorflow_tts/
|       |- models/
|       |- ...

Optional* add your own dataset parser based on tensorflow_tts/processor/experiment/example_dataset.py ( If base processor dataset didnt match yours )
Run preprocess and normalization (Step 4,5 in examples/fastspeech2_libritts/README.MD)
Run fix mismatch to fix few frames difference in audio and duration files:

python examples/mfa_extraction/fix_mismatch.py \
  --base_path ./dump \
  --trimmed_dur_path ./dataset/trimmed-durations \
  --dur_path ./dataset/durations

Problems with MFA extraction

Looks like MFA have problems with trimmed files it works better (in my experiments) with ~100ms of silence at start and end

Short files can get a lot of false positive like only silence extraction (LibriTTS example) so i would get only samples >2s