Trouble with Phi2 Tokenisation

#116
by riedgar-ms - opened

I'm trying to get Phi2 working with the guidance library, and I'm encountering problems when the prompt contains more 'complex' characters. This appears to be due to entries being missing from the tokeniser.

Consider the following code:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

target_model = "microsoft/phi-2"

model = AutoModelForCausalLM.from_pretrained(
    target_model, torch_dtype=torch.float32, trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(target_model, trust_remote_code=True)

prompt = "some string"

inputs = tokenizer(prompt, return_tensors="pt", return_attention_mask=False)
outs = model(**inputs)
print(len(tokenizer.get_vocab()))
print(outs.logits.shape)

This prints out

50295
torch.Size([1, 2, 51200])

showing that the ouput logits have more entries than the tokeniser's vocabulary.

If in the program above, I set target_model="gpt2" then the output is:

50257
torch.Size([1, 2, 50257])

so the final two dimensions are the same.

Consulting with one of the developers of guidance, has gotten a reply:

The Phi2 tokenizer does not have the byte_decoder attribute, this means we don't get tokens that are not valid strings correctly into the vocab...(they are just the � string)
For example:
tokenizer.convert_tokens_to_string([tokenizer.convert_ids_to_tokens(447)]) just gives '�' but 447 is part of what the apostrophe is encoded as so it must be a prefix of the apostrophe's unicode bytes.

Is there a way to work around this problem?

Has anyone else encountered a similar problem?

Sign up or log in to comment