mistralai/Mistral-7B-v0.1 · Disable normalization for special tokens

Oct 2, 2023

•

edited Oct 3, 2023

This PR fixes an issue in the Mistral tokenizer where the special tokens aren't tokenized correctly if concatenated with other characters, e.g.

from transformers import AutoTokenizer 

tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")

# Gives correct IDs: {'input_ids': [2], 'attention_mask': [1]}
tokenizer("</s>", add_special_tokens=False) 

# Gives correct IDs: {'input_ids': [842], 'attention_mask': [1]}
tokenizer(".", add_special_tokens=False)

# Gives incorrect IDs: {'input_ids': [842, 700, 28713, 28767], 'attention_mask': [1, 1, 1, 1]}
tokenizer(".</s>", add_special_tokens=False)

The solution is to disable normalization for all the special tokens, which is what this PR does. Note that until this PR is merged, the following workaround with the slow tokenizer can be adopted:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1", from_slow = True)

Disable normalization for special tokens5f9cf2a1

lerela changed pull request status to merged Oct 3, 2023