Tokenizer's vocab size and config.json's vocab_size mismatch!
#43
by
Yhyu13
- opened
Hi,
When inferencing with vllm, I encoutnered this error : https://github.com/vllm-project/vllm/issues/340
I found the vocab_size in config.json
has shown a vocab size of 52100
But by checking out tokenizer.config
, the max token id is 50294
And when counting number of tokens in vocab.sjon
file, there is only 50257 tokens.
I solved the previously mentioned vllm sampler error by limiting vocab_size from 52100 to 50257 in the config.json
file
Could any one explain which is the correctly number of tokens to use?
THanks!
Using self.sampler = Sampler(config.tokenizer_vocab_size)
will use the correct number of tokens we used to train Phi-2 (50295, from 0 to 50294).
gugarosa
changed discussion status to
closed