Wrong Special Token

#1
by SupercarryNg - opened

I think eos_token in tokenizer_config.json should be "<|EOT|>" instead of <|end▁of▁sentence|>
"eos_token": {
"__type": "AddedToken",
"content": "<|end▁of▁sentence|>",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false
}

DeepSeek org

eos_token is a special token, which can not be encode in str,


tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-moe-16b-base", trust_remote_code=True)
print(tokenizer.encode('<|end▁of▁sentence|>'))
# [100000, 27, 91, 409, 11009, 210, 994, 11009, 210, 53582, 66325]
print(tokenizer.decode([100001]))
# <|end▁of▁sentence|>

if you wanna to add a eos_token in the end of input_ids, your need to load tokenizer with add_eos_token = True)

tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-moe-16b-base", trust_remote_code=True, add_eos_token = True)
print(tokenizer.encode('<|end▁of▁sentence|>'))
# [100000, 27, 91, 409, 11009, 210, 994, 11009, 210, 53582, 66325,100001]

Sign up or log in to comment