Tokenizer question

#2
by psinger - opened

Very interesting that this is the only version of phi using a different tokenizer and tiktoken.
Is there any specific reason for that or just a result of experimentation?

One issue that I quickly encountered is that the custom tokenizing code returns bytes, instead of strings as other tokenizers on HF:
[b'this', b' is', b' a', b' test']

This breaks a few downstream apps and applications.

Microsoft org

This is mostly from experimentation. We found tiktoken's larger vocab to be better for multilingual performance in our preliminary experiments.

The returning of bytes is expected, since tiktoken processes raw bytes compared to strings (i.e, certain tokens in the vocab might not be valid utf-8 strings). So a concat(decoded-tokens).decode("utf-8") would be a valid string, but concat(map(lambda x: x.decode("utf-8"), decoded-tokens)) might throw a unicode decoding error. Unfortunately, because of this, returning strings for each token id is something that might cause issues. On the flip-side, this allows for more faithful encoding of mojibake and such.

I see, thanks for the quick reply.

psinger changed discussion status to closed

Sign up or log in to comment