Why 72B model has different vocab size comparing with other models?

#1
by Mikasaka - opened

I found this 72B has vocab size of 152064, while other 7b 4b models etc have vocab size of 151936. Why it is designed in such way?

I also have a similar problem. For Qwen 1.8B they mentioned that the vocab size is 151851, and the tokenizer also has the same 151851 vocabs, but in the model weights, the vocab_size is 151936. Can someone explain why it is that way? Thanks.

Qwen org

The vocabularies are the same actually. The reason why we have different sizes of vocab is our distributed training. For larger models trained across devices, we need padding for the vocab.

jklj077 changed discussion status to closed

The problem is that vLLM checks for vocab size and if it doesn't match, the speculative decoding is not enabled. If you pad, then maybe pad all models to the same vocab size.

Sign up or log in to comment