microsoft/phi-2 · Tokenizer's vocab size and config.json's vocab

Dec 23, 2023

Hi,

When inferencing with vllm, I encoutnered this error : https://github.com/vllm-project/vllm/issues/340

I found the vocab_size in config.json has shown a vocab size of 52100

But by checking out tokenizer.config, the max token id is 50294
And when counting number of tokens in vocab.sjon file, there is only 50257 tokens.

I solved the previously mentioned vllm sampler error by limiting vocab_size from 52100 to 50257 in the config.json file

Could any one explain which is the correctly number of tokens to use?

THanks!

gugarosa

Microsoft org Jan 9, 2024

Using self.sampler = Sampler(config.tokenizer_vocab_size) will use the correct number of tokens we used to train Phi-2 (50295, from 0 to 50294).

gugarosa changed discussion status to closed Jan 9, 2024

microsoft
/

phi-2

Tokenizer's vocab size and config.json's vocab_size mismatch!