File size: 2,211 Bytes
d5ee97c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 |
# MFA based extraction for FastSpeech
## Prepare
Everything is done from main repo folder so TensorflowTTS/
0. Optional* Modify MFA scripts to work with your language (https://montreal-forced-aligner.readthedocs.io/en/latest/pretrained_models.html)
1. Download pretrained mfa, lexicon and run extract textgrids:
- ```
bash examples/mfa_extraction/scripts/prepare_mfa.sh
```
- ```
python examples/mfa_extraction/run_mfa.py \
--corpus_directory ./libritts \
--output_directory ./mfa/parsed \
--jobs 8
```
After this step, the TextGrids is allocated at `./mfa/parsed`.
2. Extract duration from textgrid files:
- ```
python examples/mfa_extraction/txt_grid_parser.py \
--yaml_path examples/fastspeech2_libritts/conf/fastspeech2libritts.yaml \
--dataset_path ./libritts \
--text_grid_path ./mfa/parsed \
--output_durations_path ./libritts/durations \
--sample_rate 24000
```
- Dataset structure after finish this step:
```
|- TensorFlowTTS/
| |- LibriTTS/
| |- |- train-clean-100/
| |- |- SPEAKERS.txt
| |- |- ...
| |- dataset/
| |- |- 200/
| |- |- |- 200_124139_000001_000000.txt
| |- |- |- 200_124139_000001_000000.wav
| |- |- |- ...
| |- |- 250/
| |- |- ...
| |- |- durations/
| |- |- train.txt
| |- tensorflow_tts/
| |- models/
| |- ...
```
3. Optional* add your own dataset parser based on tensorflow_tts/processor/experiment/example_dataset.py ( If base processor dataset didnt match yours )
4. Run preprocess and normalization (Step 4,5 in `examples/fastspeech2_libritts/README.MD`)
5. Run fix mismatch to fix few frames difference in audio and duration files:
- ```
python examples/mfa_extraction/fix_mismatch.py \
--base_path ./dump \
--trimmed_dur_path ./dataset/trimmed-durations \
--dur_path ./dataset/durations
```
## Problems with MFA extraction
Looks like MFA have problems with trimmed files it works better (in my experiments) with ~100ms of silence at start and end
Short files can get a lot of false positive like only silence extraction (LibriTTS example) so i would get only samples >2s
|