`vocab_size` mismatch?

#120
by VictorSanh HF staff - opened
BigScience Workshop org

Hey!

The model card says A vocabulary size of 250,680.
len(tokenizer) returns 250680.

However, the config has "vocab_size": 250880.

Also the docstring of BloomConfig still has vocab_size (int, *optional*, defaults to 50257): which is coming from GPT2 I believe.

Is there a reason for these mismatches?

BigScience Workshop org
edited Oct 4, 2022

And at the same time, the word_embeddings matrix is of size Embedding(250880, hidden_size). I am missing something 😅

BigScience Workshop org

Hello!

There is indeed an explanation for this difference in numbers. In the config.json file, the variable vocab_size is only used to define the size of the word_embeddings and lm_head matrices. The constraint we have is that the size of these matrices must be greater than or equal to the number of tokens known by the tokenizer, the difference of 200 corresponding to "dummy" tokens that are not used.

There are several reasons in the development of BLOOM that led to this difference. The size of the word_embeddings and lm_head matrixes had to be divisible by a certain number (4*128 from memory) so that the model could be parallelized with tensor parallelism. Then, the tokenizer was produced before all the model design was finished and it was safer to leave tokens available if we needed to add special tokens for training (for PII for example).

BigScience Workshop org

Thanks @VictorSanh & @SaulLu for the explanation!
I agree that the docstring of BloomConfig is slightly confusing, I propose to address this issue in https://github.com/huggingface/transformers/pull/19336/files !

BigScience Workshop org

Thank you for the explanation and the PR @SaulLu & @ybelkada !
I understand now, closing this.

VictorSanh changed discussion status to closed

Sign up or log in to comment