# MFA based extraction for FastSpeech 

## Prepare
Everything is done from main repo folder so TensorflowTTS/

0. Optional* Modify MFA scripts to work with your language (https://montreal-forced-aligner.readthedocs.io/en/latest/pretrained_models.html)

1. Download pretrained mfa, lexicon and run extract textgrids:

- ```
  bash examples/mfa_extraction/scripts/prepare_mfa.sh
  ```

- ```
  python examples/mfa_extraction/run_mfa.py \
    --corpus_directory ./libritts \
    --output_directory ./mfa/parsed \
    --jobs 8
  ```

  After this step, the TextGrids is allocated at `./mfa/parsed`.

2. Extract duration from textgrid files:
- ```
  python examples/mfa_extraction/txt_grid_parser.py \
    --yaml_path examples/fastspeech2_libritts/conf/fastspeech2libritts.yaml \
    --dataset_path ./libritts \
    --text_grid_path ./mfa/parsed \
    --output_durations_path ./libritts/durations \
    --sample_rate 24000 
  ```

- Dataset structure after finish this step:
    ```
    |- TensorFlowTTS/
    |   |- LibriTTS/
    |   |-  |- train-clean-100/
    |   |-  |- SPEAKERS.txt
    |   |-  |- ...
    |   |- dataset/
    |   |-  |- 200/
    |   |-  |-  |- 200_124139_000001_000000.txt
    |   |-  |-  |- 200_124139_000001_000000.wav
    |   |-  |-  |- ...
    |   |-  |- 250/
    |   |-  |- ...
    |   |-  |- durations/
    |   |-  |- train.txt
    |   |- tensorflow_tts/
    |       |- models/
    |       |- ...
    ``` 
3. Optional* add your own dataset parser based on tensorflow_tts/processor/experiment/example_dataset.py ( If base processor dataset didnt match yours )

4. Run preprocess and normalization (Step 4,5 in `examples/fastspeech2_libritts/README.MD`)

5. Run fix mismatch to fix few frames difference in audio and duration files:

- ```
  python examples/mfa_extraction/fix_mismatch.py \
    --base_path ./dump \
    --trimmed_dur_path ./dataset/trimmed-durations \
    --dur_path ./dataset/durations
  ```

## Problems with MFA extraction
Looks like MFA have problems with trimmed files it works better (in my experiments) with ~100ms of silence at start and end

Short files can get a lot of false positive like only silence extraction (LibriTTS example) so i would get only samples >2s