microsoft/Phi-3-small-8k-instruct

May 21, 2024

•

edited May 21, 2024

Very interesting that this is the only version of phi using a different tokenizer and tiktoken.
Is there any specific reason for that or just a result of experimentation?

One issue that I quickly encountered is that the custom tokenizing code returns bytes, instead of strings as other tokenizers on HF:
[b'this', b' is', b' a', b' test']

This breaks a few downstream apps and applications.

bapatra

Microsoft org May 21, 2024

This is mostly from experimentation. We found tiktoken's larger vocab to be better for multilingual performance in our preliminary experiments.

The returning of bytes is expected, since tiktoken processes raw bytes compared to strings (i.e, certain tokens in the vocab might not be valid utf-8 strings). So a concat(decoded-tokens).decode("utf-8") would be a valid string, but concat(map(lambda x: x.decode("utf-8"), decoded-tokens)) might throw a unicode decoding error. Unfortunately, because of this, returning strings for each token id is something that might cause issues. On the flip-side, this allows for more faithful encoding of mojibake and such.

psinger

May 21, 2024

I see, thanks for the quick reply.

psinger changed discussion status to closed May 21, 2024

microsoft
/

Phi-3-small-8k-instruct

Tokenizer question