Spaces:
Running
Running
How to prepare dataset for training
- Download Ukrainian dataset from https://github.com/egorsmkv/speech-recognition-uk.
- Delete Common Voice folder in dataset
- Download import_ukrainian.py and put into DeepSpeech/bin folder.
- Run import script
- Download Common Voice 6.1 Ukrainian dataset
- Convert to DeepSpeech format
- Merge train.csv from dataset and from DeepSpeech into one file
- Put CV files into dataset files folder
- Put dev.csv and test.csv into folder
Note: you can also specify dataset with "," e.g. dataset1/train.csv,dataset2/train.csv.
You have a reproducible dataset!
Scorer
Refer to DeepSpeech guide for further explanations.
Generate scorer package.
python3 generate_lm.py --input_txt ../../../voice-recognition-ua/data/all_text.txt --output_dir . \
--top_k 500000 --kenlm_bins ../../../voice-recognition-ua/kenlm/build/bin \
--arpa_order 5 --max_arpa_memory "85%" --arpa_prune "0|0|1" \
--binary_a_bits 255 --binary_q_bits 8 --binary_type trie
- Run lm_optimizer to find the best scorer value.
- Rerun step 2 to generate new scorer.
Caution: scorer is very model-dependant, so you'll likely need to adjust it to each model.