CommonVoice transcription should not have been used directly

#1
by feddybear - opened

The transcriptions in CommonVoice for nan-tw combine all scripts in this format:
Hanlo/Hanzi ( Tailo | Pronunciation variants in tailo or POJ variant )

I think it's not good to fine-tune with these labels directly as they make it hard to associate acoustic patterns with the order of the texts.

Either you choose only one script, or you put similar sounds together, interspersing Hanzi and Tailo per syllable (even this latter idea is suboptimal if the tokenizer is not re-trained).

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment