eos_token_id is wrong

#2
by lise-brinck - opened

To preface, we've create a custom StoppingCriteria that stops generation when a token starting with Ġ is generated , i.e. the generation only produces one word (word, not token).
For some reason, the token "Verden" is appended to output when using beam_search together with our custom stopping_criteria. It is only an issue when using beam_search.

eos_token_id is set to 50256, but in added_token.json, <|endoftext|> has id 50265. The model has a vocab_size of 50257, but the tokenizer has a vocab_size of 50265 where the token with token_id 50256 is "Verden". Maybe this is somehow causing the issue? Even when I try to force eos_token_id for the model or pad_token_id for generate() to be 50265, it still appends "Verden" to all outputs.

Edit: When looking at vocab.json, the vocab contains 50264 tokens, but in config.json, vocab_size is set to 50257. Surely, this is a mistake?

Sign up or log in to comment