two BOS token id is right?

#97
by hpsun - opened

prompt="<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 Jul 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n" from chat template.
Another bos_token_id is generated after tokenizing from the current version of the tokenizer.
The first two positions of prompt_token_ids are both 128000(bos token id). Is it right?
image.png

I cannot reproduce. Can I see your code?

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('Meta-Llama-3.1-405B-Instruct-FP8', use_fast=True)

one bos token id

inputs_id = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True)
print(inputs_id)
#[128000, 128006, 9125, 128007, 271, 38766, 1303, 33025, 2696, 25, 6790, 220, 2366, 18, 198, 15724, 2696, 25, 220, 1627, 10263, 220, 2366, 19, 271, 128009, 128006, 882, 128007, 271, 9906, 0, 128009, 128006, 78191, 128007, 271]

two bos token ids

inputs = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs_id = tokenizer(inputs)
print(inputs_id)
#{'input_ids': [128000, 128000, 128006, 9125, 128007, 271, 38766, 1303, 33025, 2696, 25, 6790, 220, 2366, 18, 198, 15724, 2696, 25, 220, 1627, 10263, 220, 2366, 19, 271, 128009, 128006, 882, 128007, 271, 9906, 0, 128009, 128006, 78191, 128007, 271], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

hpsun changed discussion status to closed

I encountered a similar issue before and resolved it by removing the bos token from the chat template.
In my experience, this didn't noticeably affect performance, but it did complicate customizing the attention/loss mask.

You can avoid it by adding "add_special_tokens=False" argument in the second call of the tokenizer.

inputs_id = tokenizer(inputs, add_special_tokens=False)

Sign up or log in to comment