Commit f319d91 breaks eos_token_id
#9
by
tomer
- opened
Hi,
After the last commit, eos_token_id == vocab_size
, which results in an overflow in the embedding matrix.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Salesforce/codegen25-7b-multi", trust_remote_code=True,
revision="f319d912c0c73ea3682094202b209ac8cb5d4cba")
print(tokenizer.eos_token, tokenizer.eos_token_id, tokenizer.vocab_size)
prints:
<|endoftext|> 51200 51200
Yeah. I have the same issue, quite frustrating.
My workaround, manually change to 50256, which is the same EOS token
Still have this issue