Special tokens inconsistency | Applying chat template can lead to doubled `<BOS_TOKEN>`
Inconsistent special tokens
Why are the following tokens not registered as special tokens according to the tokenizer_config.json:
<|START_OF_TURN_TOKEN|>
<|USER_TOKEN|>
<|CHATBOT_TOKEN|>
But the following is considered to be a special token:
<|END_OF_TURN_TOKEN|>
This causes problems when running tokenizer.batch_decode(contents, skip_special_tokens=True)
since <|START_OF_TURN_TOKEN|>
, <|USER_TOKEN|>
, and <|CHATBOT_TOKEN|>
need to get manually stripped from the output (see below code example 3) ).
Doubled <BOS_TOKEN>
The {{ bos_token }}
being part of the chat template in tokenizer_config.json causes doubled <BOS_TOKEN>
when running processing in two steps:
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "CohereForAI/aya-23-8B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
messages = [{"role": "user", "content": "This is a sample!"}]
# 1) '<BOS_TOKEN><|START_OF_TURN_TOKEN|><|USER_TOKEN|>This is a sample!<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>'
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
# 2) ['<BOS_TOKEN>', '<BOS_TOKEN>', '<|START_OF_TURN_TOKEN|>', '<|USER_TOKEN|>', 'This', ' is', ' a', ' sample', '!', '<|END_OF_TURN_TOKEN|>', '<|START_OF_TURN_TOKEN|>', '<|CHATBOT_TOKEN|>']
tokenized_ids = tokenizer(prompt, max_length=100, padding=True, truncation=True, add_special_tokens=True).input_ids
tokenizer.batch_decode(tokenized_ids)
# 3) ['', '', '<|START_OF_TURN_TOKEN|>', '<|USER_TOKEN|>', 'This', ' is', ' a', ' sample', '!', '', '<|START_OF_TURN_TOKEN|>', '<|CHATBOT_TOKEN|>']
tokenizer.batch_decode(tokenized_ids, skip_special_tokens=True)
I am aware that for 2) we could set add_special_tokens=False
to avoid this issue but since most other models (and pre-trained models) will add the <BOS_TOKEN>
using 2) it doesn't make sense to include it in the chat template. For re-using similar code across models this just adds additional complexity to handle the special case for this model.