Why is config.vocab_size != tokenizer.vocab_size?

#18

by Qubitium - opened Apr 3

Discussion

Qubitium

Apr 3

•

edited Apr 3

@abhi-db @hanlintang @srowen Why is there a huge discrepancy between the model config.vocab_size (100352) and the actual tokenizer. vocab_size (100277). This is very strange.

abhi-db

Databricks org Apr 9

The model vocab size is padded to a larger value to 1) improve matmul efficiency and 2) leave space for extra tokens in case folks would like to finetune the model with special token ids.

abhi-db changed discussion status to closed Apr 9

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment