More Logits Than Tokens in Vocab

#4
by calbors - opened

The following snippet raises an error:

from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = "LongSafari/hyenadna-large-1m-seqlen-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True)
assert model.vocab_size == len(tokenizer.get_vocab())

Was a different vocab during training?

Sign up or log in to comment