google/mt5-base · mt5-base embedding size and tokenizer size don't match?

Jun 15, 2022

Hello! I'm trying load mt5-base (encoder-only) with transformers, and I'm finding that the config and checkpoint have more input embeddings than there are items in the vocabulary. Specifically:

from transformers import MT5TokenizerFast, MT5Config, MT5EncoderModel
cfg = MT5Config.from_pretrained("google/mt5-base")
tok = MT5TokenizerFast.from_pretrained("google/mt5-base")
mdl = MT5EncoderModel.from_pretrained("google/mt5-base", config=cfg)
print(cfg.vocab_size == mdl.get_input_embeddings().num_embeddings)
print(cfg.vocab_size == len(tok))
print(cfg.vocab_size)
print(len(tok))

prints

True
False
250112
250100

(this happens with both fast and standard tokenizers)

Is this expected? If so, what are the extra 12 tokens for?

patrickvonplaten

Jun 25, 2022

Hey @echau18 - good question that many people had :-) See: https://github.com/huggingface/transformers/issues/4875

echau18

Jun 27, 2022

Thanks @patrickvonplaten ! Seems like this is a common issue that resurfaces frequently. Any chance something could be directly added to the model card as an FYI, rather than needing to route through the transformers repo?