Weird tokens

#3
by dranger003 - opened

Thanks for converting this model! Although I see some weird tokens when running llama.cpp (compiled from master).

llm_load_print_meta: general.name   = deepseek-ai_deepseek-coder-33b-instruct
llm_load_print_meta: BOS token = 32013 '<|begin▁of▁sentence|>'
llm_load_print_meta: EOS token = 32014 '<|end▁of▁sentence|>'
llm_load_print_meta: PAD token = 32014 '<|end▁of▁sentence|>'
llm_load_print_meta: LF token  = 30 '?'

Yeah it is weird, but that's exactly what's defined in tokenizer_config.json:

 [pytorch2] tomj@MC:/workspace/git/gguf-llama (master ✘)✭ ᐅ grep -A5 -i eos /workspace/process/deepseek-ai_deepseek-coder-33b-instruct/source/tokenizer_config.json | cat -
ve
  "add_eos_token": false,$
  "bos_token": {$
    "__type": "AddedToken",$
    "content": "<M-oM-=M-^\beginM-bM-^VM-^AofM-bM-^VM-^AsentenceM-oM-=M-^\>",$
    "lstrip": false,$
    "normalized": true,$
--$
  "eos_token": {$
    "__type": "AddedToken",$
    "content": "<M-oM-=M-^\endM-bM-^VM-^AofM-bM-^VM-^AsentenceM-oM-=M-^\>",$
    "lstrip": false,$
    "normalized": true,$
    "rstrip": false,$

I don't know why they've used those weird chars, but this isn't a llama.cpp issue; it's using the tokens as defined by the original model.

FYI I'm just about to re-make all the GGUFs after an update to the convert.py I'm using, which affects special tokens. It won't change this, but might affect other aspects of special token usage.

That weird combination of characters is probably to reduce the odds of them being present in random input.

The output being garbled on dranger003's run is just a console character set issue.

Sign up or log in to comment