Issue with t5 tokenizer inconsistency.

#16
by Lvegna - opened

I am encountering a weird issue where tokenizing and then decoding substrings results in inconsistent outputs depending on context. Take this string for instance:
"23 year old Michael ' Jos ' Surgeon was last seen at Oakland . He is described as Native American , 5'8" , and 210 lbs"
When tokenizing and decoding just the substring "' Jos '" the output does not change at all and I receive back "' Jos '" (with spaces between the '). However when I tokenize and decode the whole string at once I get back:
"23 year old Michael'Jos'Surgeon was last seen at Oakland. He is described as Native American, 5'8", and 210 lbs "
Notice there are no longer spaces in-between the single quotes for the word "'Jos'"

Is there any way to stop this or at the very least know what the tokenizer will tokenize a substring to in the whole sentence tokenization? I am using tokenizer.encode_plus() as a side note and loading from pretrained of flan-t5-large.

Sign up or log in to comment