vishred18's picture
Upload 364 files
d5ee97c
|
raw
history blame
No virus
2.21 kB

MFA based extraction for FastSpeech

Prepare

Everything is done from main repo folder so TensorflowTTS/

  1. Optional* Modify MFA scripts to work with your language (https://montreal-forced-aligner.readthedocs.io/en/latest/pretrained_models.html)

  2. Download pretrained mfa, lexicon and run extract textgrids:

  • bash examples/mfa_extraction/scripts/prepare_mfa.sh
    
  • python examples/mfa_extraction/run_mfa.py \
      --corpus_directory ./libritts \
      --output_directory ./mfa/parsed \
      --jobs 8
    

    After this step, the TextGrids is allocated at ./mfa/parsed.

  1. Extract duration from textgrid files:
  • python examples/mfa_extraction/txt_grid_parser.py \
      --yaml_path examples/fastspeech2_libritts/conf/fastspeech2libritts.yaml \
      --dataset_path ./libritts \
      --text_grid_path ./mfa/parsed \
      --output_durations_path ./libritts/durations \
      --sample_rate 24000 
    
  • Dataset structure after finish this step:

    |- TensorFlowTTS/
    |   |- LibriTTS/
    |   |-  |- train-clean-100/
    |   |-  |- SPEAKERS.txt
    |   |-  |- ...
    |   |- dataset/
    |   |-  |- 200/
    |   |-  |-  |- 200_124139_000001_000000.txt
    |   |-  |-  |- 200_124139_000001_000000.wav
    |   |-  |-  |- ...
    |   |-  |- 250/
    |   |-  |- ...
    |   |-  |- durations/
    |   |-  |- train.txt
    |   |- tensorflow_tts/
    |       |- models/
    |       |- ...
    
  1. Optional* add your own dataset parser based on tensorflow_tts/processor/experiment/example_dataset.py ( If base processor dataset didnt match yours )

  2. Run preprocess and normalization (Step 4,5 in examples/fastspeech2_libritts/README.MD)

  3. Run fix mismatch to fix few frames difference in audio and duration files:

  • python examples/mfa_extraction/fix_mismatch.py \
      --base_path ./dump \
      --trimmed_dur_path ./dataset/trimmed-durations \
      --dur_path ./dataset/durations
    

Problems with MFA extraction

Looks like MFA have problems with trimmed files it works better (in my experiments) with ~100ms of silence at start and end

Short files can get a lot of false positive like only silence extraction (LibriTTS example) so i would get only samples >2s