CommonVoice transcription should not have been used directly
#1
by
feddybear
- opened
The transcriptions in CommonVoice for nan-tw combine all scripts in this format:
Hanlo/Hanzi ( Tailo | Pronunciation variants in tailo or POJ variant )
I think it's not good to fine-tune with these labels directly as they make it hard to associate acoustic patterns with the order of the texts.
Either you choose only one script, or you put similar sounds together, interspersing Hanzi and Tailo per syllable (even this latter idea is suboptimal if the tokenizer is not re-trained).