Special tokens inconsistency | Applying chat template can lead to doubled `<BOS_TOKEN>`

#9
by robinschmidt - opened

Inconsistent special tokens

Why are the following tokens not registered as special tokens according to the tokenizer_config.json:

  • <|START_OF_TURN_TOKEN|>
  • <|USER_TOKEN|>
  • <|CHATBOT_TOKEN|>

But the following is considered to be a special token:

  • <|END_OF_TURN_TOKEN|>

This causes problems when running tokenizer.batch_decode(contents, skip_special_tokens=True) since <|START_OF_TURN_TOKEN|>, <|USER_TOKEN|>, and <|CHATBOT_TOKEN|> need to get manually stripped from the output (see below code example 3) ).

Doubled <BOS_TOKEN>

The {{ bos_token }} being part of the chat template in tokenizer_config.json causes doubled <BOS_TOKEN> when running processing in two steps:

from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "CohereForAI/aya-23-8B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
messages = [{"role": "user", "content": "This is a sample!"}]

# 1) '<BOS_TOKEN><|START_OF_TURN_TOKEN|><|USER_TOKEN|>This is a sample!<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>'
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

# 2)  ['<BOS_TOKEN>', '<BOS_TOKEN>', '<|START_OF_TURN_TOKEN|>', '<|USER_TOKEN|>', 'This', ' is', ' a', ' sample', '!', '<|END_OF_TURN_TOKEN|>', '<|START_OF_TURN_TOKEN|>', '<|CHATBOT_TOKEN|>'] 
tokenized_ids = tokenizer(prompt, max_length=100, padding=True, truncation=True, add_special_tokens=True).input_ids
tokenizer.batch_decode(tokenized_ids)

# 3) ['', '', '<|START_OF_TURN_TOKEN|>', '<|USER_TOKEN|>', 'This', ' is', ' a', ' sample', '!', '', '<|START_OF_TURN_TOKEN|>', '<|CHATBOT_TOKEN|>']
tokenizer.batch_decode(tokenized_ids, skip_special_tokens=True)

I am aware that for 2) we could set add_special_tokens=False to avoid this issue but since most other models (and pre-trained models) will add the <BOS_TOKEN> using 2) it doesn't make sense to include it in the chat template. For re-using similar code across models this just adds additional complexity to handle the special case for this model.

Sign up or log in to comment