Error in tokenizer_config.json?

#1
by m9e - opened

Sorry if this is inane, but the tokenizer_config.json specifies an eos_token of "< | end_of_sentence | >" but the config.json has an eos_token_id 32021 which is "<|EOT|>"; this means you get garbage post-generation to fill the whole max tokens if the software you are using (in this case, text-generation-webui) reads the eos_token from the tokenizer config rather than the config.json. (And I found adding the eos_token string to the custom stopping strings is fine in the ui, and have been on a little journey to figure out what param if any of the openai-compatible params can replicate that, and it may be none; and I'm unsure if there's a reason the tokenizer config is different or if that ended up being a side-effect - but I do notice the current full weight og model appears to have the EOT syntax

https://huggingface.co/deepseek-ai/deepseek-coder-33b-instruct/blob/main/tokenizer_config.json

Thanks, I just need to update configs to match the revised base.

Sign up or log in to comment