Self-Training with Kaldi HMM Models

This folder contains recipes for self-training on pseudo phone transcripts and decoding into phones or words with kaldi.

To start, download and install kaldi follow its instruction, and place this folder in path/to/kaldi/egs.

Training

Assuming the following has been prepared:

w2v_dir: contains features {train,valid}.{npy,lengths}, real transcripts {train,valid}.${label}, and dict dict.${label}.txt
lab_dir: contains pseudo labels {train,valid}.txt
arpa_lm: Arpa-format n-gram phone LM for decoding
arpa_lm_bin: Arpa-format n-gram phone LM for unsupervised model selection to be used with KenLM

Set these variables in train.sh, as well as out_dir, the output directory, and then run it.

The output will be:

==== WER w.r.t. real transcript (select based on unsupervised metric)
INFO:root:./out/exp/mono/decode_valid/scoring/14.0.0.tra.txt: score 0.9178 wer 28.71% lm_ppl 24.4500 gt_wer 25.57%
INFO:root:./out/exp/tri1/decode_valid/scoring/17.1.0.tra.txt: score 0.9257 wer 26.99% lm_ppl 30.8494 gt_wer 21.90%
INFO:root:./out/exp/tri2b/decode_valid/scoring/8.0.0.tra.txt: score 0.7506 wer 23.15% lm_ppl 25.5944 gt_wer 15.78%

where wer is the word eror rate with respect to the pseudo label, gt_wer to the ground truth label, lm_ppl the language model perplexity of HMM prediced transcripts, and score is the unsupervised metric for model selection. We choose the model and the LM parameter of the one with the lowest score. In the example above, it is tri2b, 8.0.0.

Decoding into Phones

In decode_phone.sh, set out_dir the same as used in train.sh, set dec_exp and dec_lmparam to the selected model and LM parameter (e.g. tri2b and 8.0.0 in the above example). dec_script needs to be set according to dec_exp: for mono/tri1/tri2b, use decode.sh; for tri3b, use decode_fmllr.sh.

The output will be saved at out_dir/dec_data

Decoding into Words

decode_word_step1.sh prepares WFSTs for word decoding. Besides the variables mentioned above, set

wrd_arpa_lm: Arpa-format n-gram word LM for decoding
wrd_arpa_lm_bin: Arpa-format n-gram word LM for unsupervised model selection

decode_word_step1.sh decodes the train and valid split into word and runs unsupervised model selection using the valid split. The output is like:

INFO:root:./out/exp/tri2b/decodeword_valid/scoring/17.0.0.tra.txt: score 1.8693 wer 24.97% lm_ppl 1785.5333 gt_wer 31.45%

After determining the LM parameter (17.0.0 in the example above), set it in decode_word_step2.sh and run it. The output will be saved at out_dir/dec_data_word.