Vocab size of tokenizer is not equal to vocab size of the model

#2
by dinhanhx - opened

I notice that in config.json, the vocab size of the model is 256512. However, the vocab size of the tokenizer is 256000.

image.png

dinhanhx changed discussion title from Vocab size is 256512 or 25600? to Vocab size of tokenizer is not equal to vocab size of the model

The vocab size is set to 256512 in the original gin file. That matches the dimensions of the weights.

I believe the difference is due to EXTRA_IDS = 512, also from the gin file. You can fine-tune those for your own needs.

jbochi changed discussion status to closed

Ah I see thanks for clarifying

Sign up or log in to comment