Why is config.vocab_size != tokenizer.vocab_size?

#18
by Qubitium - opened

@abhi-db @hanlintang @srowen Why is there a huge discrepancy between the model config.vocab_size (100352) and the actual tokenizer. vocab_size (100277). This is very strange.

Databricks org

The model vocab size is padded to a larger value to 1) improve matmul efficiency and 2) leave space for extra tokens in case folks would like to finetune the model with special token ids.

abhi-db changed discussion status to closed

Sign up or log in to comment