tokenizer newline issue and incorrect model card

#1
by ischlag - opened

1.) The model card has the same code for all tk-instruct models. There is no separate vocab for the 11b models so I guess it is the same as the 3b model? It does seem to work but it would be nice if this could be confirmed.

2.) The tokenizer, as described by the model card, ignores newlines and certain whitespaces. I think that those whitespaces are necessary in order to follow the input template as described by the tk-instruct paper. How do we change the tokenizer.encode such that it does not ignore whitespace?

Regarding 2) something that has been proposed but DOESNT work is:

from tokenizers import AddedToken
tokenizer.add_special_tokens({"additional_special_tokens": [AddedToken("\n"), AddedToken("\t")]}

But this adds a whitespace character after the newline:
x = tokenizer.encode("A\nB\n\nC\t\t\tD", return_tensors="pt")
tokenizer.decode(x[0], skip_special_tokens=False) # 'A\n B\n\n C\t\t\t D'

The newline character is in the vocab so I guess the sentencepiece preprocessing is removing it.
'\n' in tokenizer.vocab.keys(). # True

Ok, so I'm pretty sure now that there are no whitespaces in any t5 model and the formatting of the template as presented in the paper is being removed by the tokenizer.

Allen Institute for AI org

Re 1), yeah the vocab for 11b and 3b should be the same.
Re 2), \n is not in the original T5 vocab. T5 will replace them with a blank space during tokenization. When training the models, we didn't have special processing here. I don't have good suggestions for modifying the T5 vocab. But the way you did looks reasonable to me if you can retrain the model.

thanks for clarifying

ischlag changed discussion status to closed

Sign up or log in to comment