Tokenizer's vocab size and config.json's vocab_size mismatch!

#43
by Yhyu13 - opened

Hi,

When inferencing with vllm, I encoutnered this error : https://github.com/vllm-project/vllm/issues/340

I found the vocab_size in config.json has shown a vocab size of 52100

But by checking out tokenizer.config, the max token id is 50294
And when counting number of tokens in vocab.sjon file, there is only 50257 tokens.

I solved the previously mentioned vllm sampler error by limiting vocab_size from 52100 to 50257 in the config.json file

Could any one explain which is the correctly number of tokens to use?

THanks!

Microsoft org

Using self.sampler = Sampler(config.tokenizer_vocab_size) will use the correct number of tokens we used to train Phi-2 (50295, from 0 to 50294).

gugarosa changed discussion status to closed

Sign up or log in to comment