Flan-T5 tokenizer supports neither Chinese nor many code-related tokens despite being advertised as such

#33
by michaelroyzen - opened

It seems that the Flan-T5 tokenizer can't handle Chinese (despite being advertised as such on the model card) as well as many programming-related tokens such as "{" or "\n" or "\t". How is this the case if Flan-T5 was truly fine-tuned on coding datasets as described in the paper? Is the publicly-available tokenizer flawed, or was the entire fine-tuning procedure flawed?

@ybelkada

@michaelroyzen Flan-T5 uses the T5 tokenizer, which is English-only and not well suited to coding tasks. We do include multilingual and coding tasks in the Flan Collection, which plays well with multilingual models and appropriate tokenizers, but of course may not overcome the limitations of the T5 tokenizer. With limited experiments we did not see any evidence that including coding or multilingual tasks hurt Flan-T5 for English held-in and held-out evaluation tasks (not including program synthesis eval), but may even have helped by adding task diversity.

If you'd like to do multilingual or coding tasks we do recommend applying FLAN tuning on top of a more appropriate model/tokenizer, and even up-sample the data sources you're targeting.

Sign up or log in to comment