tokenizer newline issue and incorrect model card

by ischlag - opened Sep 1, 2022

Sep 1, 2022

1.) The model card has the same code for all tk-instruct models. There is no separate vocab for the 11b models so I guess it is the same as the 3b model? It does seem to work but it would be nice if this could be confirmed.

2.) The tokenizer, as described by the model card, ignores newlines and certain whitespaces. I think that those whitespaces are necessary in order to follow the input template as described by the tk-instruct paper. How do we change the tokenizer.encode such that it does not ignore whitespace?

ischlag

Sep 1, 2022

Regarding 2) something that has been proposed but DOESNT work is:

from tokenizers import AddedToken
tokenizer.add_special_tokens({"additional_special_tokens": [AddedToken("\n"), AddedToken("\t")]}

But this adds a whitespace character after the newline:
x = tokenizer.encode("A\nB\n\nC\t\t\tD", return_tensors="pt")
tokenizer.decode(x[0], skip_special_tokens=False) # 'A\n B\n\n C\t\t\t D'

The newline character is in the vocab so I guess the sentencepiece preprocessing is removing it.
'\n' in tokenizer.vocab.keys(). # True

ischlag

Sep 2, 2022

Ok, so I'm pretty sure now that there are no whitespaces in any t5 model and the formatting of the template as presented in the paper is being removed by the tokenizer.

yizhongw

Ai2 org Sep 2, 2022

Re 1), yeah the vocab for 11b and 3b should be the same.
Re 2), \n is not in the original T5 vocab. T5 will replace them with a blank space during tokenization. When training the models, we didn't have special processing here. I don't have good suggestions for modifying the T5 vocab. But the way you did looks reasonable to me if you can retrain the model.

ischlag

Sep 3, 2022

thanks for clarifying

ischlag changed discussion status to closed Sep 3, 2022

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment