eos_token clarification

#1
by Starlento - opened

I found in tokenizer_config.json, it is a standard chatml template. But I checked the model it seems to use <|endoftext|> as eos_token for some cases.

Inference code:

messages = [
    {"role": "user", "content": "你好"}
]

input_ids = tokenizer.apply_chat_template(conversation=messages, tokenize=True, return_tensors='pt')
output_ids = text_model.generate(input_ids.to('cuda'), eos_token_id=tokenizer.eos_token_id, max_length=256)
response = tokenizer.decode(output_ids[0], skip_special_tokens=False)
# response = tokenizer.decode(output_ids[0][input_ids.shape[1]:], skip_special_tokens=True)
print(response)

Result:

<|im_start|>user
你好<|im_end|> 
<|im_start|>assistant
你好!有什么我可以帮助你的吗?<|endoftext|>你好!有什么我可以帮助你的吗?

如果你有任何问题或需要信息,请随时告诉我!我在这里帮助你。<|im_end|>

But for "hi", it seems normal.

Could you kindly check this problem?

same problem

Sign up or log in to comment