Support Japanese?

#7
by kosukekurimoto - opened

I tried some Japanese.

However, for questions that include Japanese, an empty string is returned.

-translate
English -> japanese

  • Question
    ...

There's no CJK characters in tokenizer.json

looks different from what is described in the paper

It seems that the public flan-t5 does not support CJK, only the flan-palm mentioned in the paper but not public release does.

Google org

Hi @kosukekurimoto ,
You may be interested to use BLOOMZ: https://huggingface.co/bigscience/bloomz / or mt0-xxl: https://huggingface.co/bigscience/mt0-xxl-mt these models seem to produce consistent output in Japanese even though they are not explicitly trained in Japanese:
Screenshot 2022-11-09 at 22.22.16.png

@kosukekurimoto @qhduan Flan-T5 uses the T5 tokenizer, which is English-only. We do include multilingual and coding tasks in the Flan Collection, which plays well with multilingual models like PaLM which have appropriate tokenizers, but of course may not overcome the limitations of the T5 tokenizer.

If you'd like to do multilingual tasks we do recommend applying FLAN tuning on top of a more appropriate model/tokenizer, and even up-sample the data sources you're targeting.

Sign up or log in to comment