Vocabulary size mismatch between tokenizer and model

#1
by ehrencrona - opened

There seems to be more tokens defined in the tokenizer than in the model config:

from transformers import TFGPT2LMHeadModel, GPT2Tokenizer

cls = "birgermoell/swedish-gpt"

model = TFGPT2LMHeadModel.from_pretrained(cls, from_pt=True)
tokenizer = GPT2Tokenizer.from_pretrained(cls)

(tokenizer.vocab_size, model.config.vocab_size)

gives me

(50265, 50257)

Could there be some sort of version mismatch?

BTW, this can also be seen by entering "asha" in the inference input field, which gives you an "unknown error". This is because it's one of the last words in the vocabulary and therefore results in an out of range index somewhere.

Sign up or log in to comment