Segmenting output by phone
Is it possible to segment the output by phones in order to calculate PER instead of CER?
I tried applying Wav2Vec2PhonemeCTCTokenizer (with the current vocab.json) but the results were horrible.
Assuming the label data is already tokenized, we can map one or two characters phoneme tʃ
into a number, and convert that number into a character using chr(x)
Then, we can use the same CER. This time it would be on phone-level.
Ohh good idea! Would this mean that the model would need to be retrained?
Currently, I'm just using the pipeline approach and taking the text output
It's not needed to be re-trained. The model is trained to output bunch of numbers, like 5, 12, 6, 18. We just need to map that number to a different string.
Using pipeline approach does the whole conversion into string as well. It's a slightly difficult to manipulate at a granular level.
I was able to get the segmented phones by following Approach 2 and passing output_char_offsets=True to batch_decode