Discrepancy in vocab size

#1
by richardlian - opened

In config.json, it states that the vocab size is 61952, however, if we access vocab_size attribute on the tokenizer object, it states that the vocab size is 61873. What is the reason for this discrepancy? Is it okay if I change config.json to match the tokenizer's vocab size?

MediaTek Research org
edited Mar 31

Hi Richard,

The reason is that it is padded, because of tensor parallelism requirements. The last dimensions of the embeddings are not used, and the effective vocab size is 61873. You are allowed to change config.json, but I believe you would also need to change the model embedding matrix as well.

Best,
Jeff

Thanks for the reply!

I'm trying to get the .vocab_size attribute to match with config.json so I want to add extra padding tokens. The reason I am asking is because NeMo's conversion script to transform a Transformers model into their .nemo format checks that the vocab sizes are consistent. I probably don't want to disable any checks that the script makes.

Sign up or log in to comment