Non-English languages not working

#6
by diwank - opened

Prompt:
Translate to English. क्या तुम जादूगर हो?

Output:
?

Try providing source language e.g. "Translate Spanish to English:"

Letter "ñ" in spanish is not appearing. I believe is a problem with the Tokenizer

Can confirm this does not work for Japanese or other additional language other than T5 supported.

input_text = "translate English to Japanese: How old are you?"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
outputs = model.generate(input_ids)

input_ids: tensor([[13959, 1566, 12, 4318, 10, 571, 625, 33, 25, 58, 1]], device='cuda:0')
output: tensor([[ 0, 3, 2, 3, 58, 1]], device='cuda:0')

I think the tokenizer for input encoding is good, but the model output of the 0,3,2,3 doesn't seem right.
so that is maybe why decoder show empty.
maybe we need a MT5 version of this model to support other languages.

this part is confusing. Suggest to revise this part of the model card:

Language(s) (NLP): English, Spanish, Japanese, Persian, Hindi, French, Chinese, Bengali, Gujarati, German, Telugu, Italian, Arabic, Polish, Tamil, Marathi, Malayalam, Oriya, Panjabi, Portuguese, Urdu, Galician, Hebrew, Korean, Catalan, Thai, Dutch, Indonesian, Vietnamese, Bulgarian, Filipino, Central Khmer, Lao, Turkish, Russian, Croatian, Swedish, Yoruba, Kurdish, Burmese, Malay, Czech, Finnish, Somali, Tagalog, Swahili, Sinhala, Kannada, Zhuang, Igbo, Xhosa, Romanian, Haitian, Estonian, Slovak, Lithuanian, Greek, Nepali, Assamese, Norwegian

@leli You are correct, Flan-T5 is based on T5 and its tokenizer is English only. I did not write this model card, but it should just say English. However, Flan-PaLM, or other base models that are multilingual and finetune on the Flan Collection should have multilingual abilities.

@Shayne have u tried FLan-PaLM ? Its pretty huge.

Sign up or log in to comment