mt5-base embedding size and tokenizer size don't match?

#2
by echau18 - opened

Hello! I'm trying load mt5-base (encoder-only) with transformers, and I'm finding that the config and checkpoint have more input embeddings than there are items in the vocabulary. Specifically:

from transformers import MT5TokenizerFast, MT5Config, MT5EncoderModel
cfg = MT5Config.from_pretrained("google/mt5-base")
tok = MT5TokenizerFast.from_pretrained("google/mt5-base")
mdl = MT5EncoderModel.from_pretrained("google/mt5-base", config=cfg)
print(cfg.vocab_size == mdl.get_input_embeddings().num_embeddings)
print(cfg.vocab_size == len(tok))
print(cfg.vocab_size)
print(len(tok))

prints

True
False
250112
250100

(this happens with both fast and standard tokenizers)

Is this expected? If so, what are the extra 12 tokens for?

Thanks @patrickvonplaten ! Seems like this is a common issue that resurfaces frequently. Any chance something could be directly added to the model card as an FYI, rather than needing to route through the transformers repo?

Sign up or log in to comment