wrong bos_token
#15
by
loubnabnl
HF staff
- opened
The bos_token
in special_tokens_map.json
seems to be </s>
when it should be <s>
since that's the token being added at start of each sequence (id 1)
tokenizer = AutoTokenizer.from_pretrained("WizardLM/WizardCoder-Python-34B-V1.0")
>>> tokenizer.bos_token
'</s>'
>>> tokenizer.bos_token_id
2
>>> tokenizer("</s>")
{'input_ids': [1, 2], 'attention_mask': [1, 1]}
During training, we set bos_token and eos_token to be the same </s>
.
I see, but in this case shouldn't the bos_token
be changed in tokenizer_config to be </s>
for consistency so it's the token that gets added at the beginning?
I have done an extra experiment to change the bos_token
in the tokenizer_config to </s>
on our WizardLM/WizardCoder-Python-7B-V1.0
. The humaneval score decreases from 55.5 to 51.2. Therefore, we cannot modify the bos_token
in the tokenizer_config
.
WizardLM
changed discussion status to
closed