A newer version of the Streamlit SDK is available:
1.40.1
MFA based extraction for FastSpeech
Prepare
Everything is done from main repo folder so TensorflowTTS/
Optional* Modify MFA scripts to work with your language (https://montreal-forced-aligner.readthedocs.io/en/latest/pretrained_models.html)
Download pretrained mfa, lexicon and run extract textgrids:
bash examples/mfa_extraction/scripts/prepare_mfa.sh
python examples/mfa_extraction/run_mfa.py \ --corpus_directory ./libritts \ --output_directory ./mfa/parsed \ --jobs 8
After this step, the TextGrids is allocated at
./mfa/parsed
.
- Extract duration from textgrid files:
python examples/mfa_extraction/txt_grid_parser.py \ --yaml_path examples/fastspeech2_libritts/conf/fastspeech2libritts.yaml \ --dataset_path ./libritts \ --text_grid_path ./mfa/parsed \ --output_durations_path ./libritts/durations \ --sample_rate 24000
Dataset structure after finish this step:
|- TensorFlowTTS/ | |- LibriTTS/ | |- |- train-clean-100/ | |- |- SPEAKERS.txt | |- |- ... | |- dataset/ | |- |- 200/ | |- |- |- 200_124139_000001_000000.txt | |- |- |- 200_124139_000001_000000.wav | |- |- |- ... | |- |- 250/ | |- |- ... | |- |- durations/ | |- |- train.txt | |- tensorflow_tts/ | |- models/ | |- ...
Optional* add your own dataset parser based on tensorflow_tts/processor/experiment/example_dataset.py ( If base processor dataset didnt match yours )
Run preprocess and normalization (Step 4,5 in
examples/fastspeech2_libritts/README.MD
)Run fix mismatch to fix few frames difference in audio and duration files:
python examples/mfa_extraction/fix_mismatch.py \ --base_path ./dump \ --trimmed_dur_path ./dataset/trimmed-durations \ --dur_path ./dataset/durations
Problems with MFA extraction
Looks like MFA have problems with trimmed files it works better (in my experiments) with ~100ms of silence at start and end
Short files can get a lot of false positive like only silence extraction (LibriTTS example) so i would get only samples >2s