Segmenting output by phone

#2
by kalbin - opened

Is it possible to segment the output by phones in order to calculate PER instead of CER?

I tried applying Wav2Vec2PhonemeCTCTokenizer (with the current vocab.json) but the results were horrible.

Assuming the label data is already tokenized, we can map one or two characters phoneme into a number, and convert that number into a character using chr(x).
Then, we can use the same CER. This time it would be on phone-level.

Ohh good idea! Would this mean that the model would need to be retrained?

Currently, I'm just using the pipeline approach and taking the text output

@kalbin It's not needed to be re-trained. The model is trained to output bunch of numbers, like 5, 12, 6, 18. We just need to map that number to a different string.
Using pipeline approach does the whole conversion into string as well. It's a slightly difficult to manipulate at a granular level.

I was able to get the segmented phones by following Approach 2 and passing output_char_offsets=True to batch_decode

vitouphy changed discussion status to closed

Sign up or log in to comment