wrong bos_token

#15

by loubnabnl - opened Aug 29, 2023

Aug 29, 2023

•

edited Aug 29, 2023

The bos_token in special_tokens_map.json seems to be </s> when it should be <s> since that's the token being added at start of each sequence (id 1)

tokenizer = AutoTokenizer.from_pretrained("WizardLM/WizardCoder-Python-34B-V1.0")
>>> tokenizer.bos_token
'</s>'
>>> tokenizer.bos_token_id
2
>>> tokenizer("</s>")
{'input_ids': [1, 2], 'attention_mask': [1, 1]}

Ziyang

WizardLM Team org Aug 29, 2023

During training, we set bos_token and eos_token to be the same </s>.

loubnabnl

Aug 29, 2023

I see, but in this case shouldn't the bos_token be changed in tokenizer_config to be </s> for consistency so it's the token that gets added at the beginning?

Ziyang

WizardLM Team org Aug 30, 2023

I have done an extra experiment to change the bos_token in the tokenizer_config to </s> on our WizardLM/WizardCoder-Python-7B-V1.0. The humaneval score decreases from 55.5 to 51.2. Therefore, we cannot modify the bos_token in the tokenizer_config.

WizardLM changed discussion status to closed Sep 14, 2023

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment