Question about vocab size

#1
by johnsongwx - opened

Hi! This is a great work and I am trying to follow it.
When I apply the model to another framework, I discovered that the vocab size in the json file says 30522, however the vocab.txt file only contains 28895 lines (words). Shouldn't these 2 number be the same? Or am I understanding anything wrong?
Looking forward to your reply. Thanks a lot!

Microsoft org
β€’
edited Jun 11, 2022

Thanks for your comment. When checking the size of the embedding matrix (cc @nbroad ):

from transformers import BertModel

model = BertModel.from_pretrained("microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract")

print(model.embeddings.word_embeddings.weight.shape)

it does print a shape of torch.Size([30522, 768]).

I guess that the last vectors of the embedding matrix are actually never used and could be removed from the model.

However, simply updating the vocab_size attribute of the config will result in an error, as it will complain that the updated size doesn't match the size of the embedding matrix. So one should update the vocab_size attribute and the embedding matrix at the same time.

Sign up or log in to comment