Different size between tokenizer vocab and embedding
#1
by
demharters
- opened
There seems to be a discrepancy between vocab length and embedding size. Any ideas why?
model_name='qilowoq/AbLang_heavy'
tokenizer = AutoTokenizer.from_pretrained(model_name, revision='c451857')
model = AutoModelForMaskedLM.from_pretrained(model_name, trust_remote_code=True, revision = 'c451857')
embedding_size = model.roberta.embeddings.word_embeddings.weight.size(0)
print(f"Embedding size: {embedding_size}")
vocab_size = tokenizer.vocab_size
print(f"Vocabulary size: {vocab_size}")```
"Embedding size: 24
Vocabulary size: 25"
Yes. That was intentional.
Tokenizer needs [UNK] token, but there were no such token in original model. So [UNK] token is 25th token. It would not affect model unless there is unknown animoacid in sequence.
It's just that I got an error when starting fine tuning due to the discrepancy. Thanks for clarifying.
qilowoq
changed discussion status to
closed