wrong bos_token

#15
by loubnabnl HF staff - opened

The bos_token in special_tokens_map.json seems to be </s> when it should be <s> since that's the token being added at start of each sequence (id 1)

tokenizer = AutoTokenizer.from_pretrained("WizardLM/WizardCoder-Python-34B-V1.0")
>>> tokenizer.bos_token
'</s>'
>>> tokenizer.bos_token_id
2
>>> tokenizer("</s>")
{'input_ids': [1, 2], 'attention_mask': [1, 1]}
WizardLM Team org

During training, we set bos_token and eos_token to be the same </s>.

I see, but in this case shouldn't the bos_token be changed in tokenizer_config to be </s> for consistency so it's the token that gets added at the beginning?

WizardLM Team org

I have done an extra experiment to change the bos_token in the tokenizer_config to </s> on our WizardLM/WizardCoder-Python-7B-V1.0. The humaneval score decreases from 55.5 to 51.2. Therefore, we cannot modify the bos_token in the tokenizer_config.

WizardLM changed discussion status to closed

Sign up or log in to comment