question on language coverage

by rjrobben - opened Dec 22, 2023

Dec 22, 2023

I am wondering why for TTS why there's coverage for less popular language (like hakka in chinese) than much more popular language (like mandarin/cantonese in chinese).

Sounds unintuitive to me as it's much harder to get training data for less popular language.

Thanks!

rjrobben

Dec 22, 2023

Is it related to the tokenization of the language, i can see in hak, the vocab.txt is very minimal.

ydshieh

Dec 22, 2023

Hi @rjrobben

Could you elaborate a bit more what you mean by there's coverage for less popular language (like hakka in chinese) than much more popular language (like mandarin/cantonese in chinese)..
I could not find anything in the model card or in the paper about this.

rjrobben

Dec 22, 2023

Hi @ydshieh , thanks for the reply.

If you look at https://dl.fbaipublicfiles.com/mms/misc/language_coverage_mms.html

And search "hak", you can see there is TTS support for Hakka language in Chinese.

But if you search “mandarin” or “yue”, you can see they have no TTS support.

If you check most spoken languages list, you can see yue and mandarin are much more popular than hak:
https://en.m.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers

ydshieh

Dec 22, 2023

Thanks a lot! Indeed!

@vineelpratap Do you know why? I see you are the author of many commits in this repository so think you know the best.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment