eos_token clarification

by Starlento - opened May 27, 2024

May 27, 2024

I found in tokenizer_config.json, it is a standard chatml template. But I checked the model it seems to use <|endoftext|> as eos_token for some cases.

Inference code:

messages = [
    {"role": "user", "content": "你好"}
]

input_ids = tokenizer.apply_chat_template(conversation=messages, tokenize=True, return_tensors='pt')
output_ids = text_model.generate(input_ids.to('cuda'), eos_token_id=tokenizer.eos_token_id, max_length=256)
response = tokenizer.decode(output_ids[0], skip_special_tokens=False)
# response = tokenizer.decode(output_ids[0][input_ids.shape[1]:], skip_special_tokens=True)
print(response)

Result:

<|im_start|>user
你好<|im_end|> 
<|im_start|>assistant
你好！有什么我可以帮助你的吗？<|endoftext|>你好！有什么我可以帮助你的吗？

如果你有任何问题或需要信息，请随时告诉我！我在这里帮助你。<|im_end|>

But for "hi", it seems normal.

Could you kindly check this problem?

raincandy-u

May 27, 2024

same problem

nazimali

Oct 16, 2024

Hello, any update on this?

haijian06

Oct 16, 2024

Hi 👋 We tried to reproduce the issue, but we didn't encounter any errors during inference. Could you please try again using this inference code: https://github.com/01-ai/Yi/blob/main/Cookbook/en/opensource/Inference/Inference_using_transformers.ipynb

nazimali

Oct 16, 2024

Without running the model, you can also see the issue is a misconfiguration:

The tokenizer_config.json chat_template and eos_tokendefine it as <|im_end|>
But the model config.json defines eos_token_id=2, which maps to <|endoftext|>
- The tokenizer config defines token ID 2 as <|endoftext|>

There's a mismatch between the models config.json and tokenizer_config.json chat template and eos_token.

nazimali

Oct 16, 2024

My guess is the reason @Starlento example is returning inconsistent responses is because eos_token_id=tokenizer.eos_token_id is passed, but this should work and indicates the misconfiguration between config.json and tokenizer_config.json.

output_ids = text_model.generate(input_ids.to('cuda'), eos_token_id=tokenizer.eos_token_id, max_length=256)

haijian06

Oct 17, 2024

Okay, @nazimali , I understand what you're saying. However, if you try replacing <|im_end|> with <|endoftext|>, you'll encounter an error. It might be better to keep it as it is for now.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment