Segmenting output by phone
Is it possible to segment the output by phones in order to calculate PER instead of CER?
I tried applying Wav2Vec2PhonemeCTCTokenizer (with the current vocab.json) but the results were horrible.
Assuming the label data is already tokenized, we can map one or two characters phoneme tʃ
into a number, and convert that number into a character using chr(x)
.
Then, we can use the same CER. This time it would be on phone-level.
Ohh good idea! Would this mean that the model would need to be retrained?
Currently, I'm just using the pipeline approach and taking the text output
@kalbin
It's not needed to be re-trained. The model is trained to output bunch of numbers, like 5, 12, 6, 18. We just need to map that number to a different string.
Using pipeline approach does the whole conversion into string as well. It's a slightly difficult to manipulate at a granular level.
I was able to get the segmented phones by following Approach 2 and passing output_char_offsets=True to batch_decode