TheBloke/deepseek-coder-33B-instruct-GGUF

Nov 5, 2023

Thanks for converting this model! Although I see some weird tokens when running llama.cpp (compiled from master).

llm_load_print_meta: general.name   = deepseek-ai_deepseek-coder-33b-instruct
llm_load_print_meta: BOS token = 32013 '<∩╜£beginΓûüofΓûüsentence∩╜£>'
llm_load_print_meta: EOS token = 32014 '<∩╜£endΓûüofΓûüsentence∩╜£>'
llm_load_print_meta: PAD token = 32014 '<∩╜£endΓûüofΓûüsentence∩╜£>'
llm_load_print_meta: LF token  = 30 '?'

TheBloke

Owner Nov 5, 2023

Yeah it is weird, but that's exactly what's defined in tokenizer_config.json:

 [pytorch2] tomj@MC:/workspace/git/gguf-llama (master ✘)✭ ᐅ grep -A5 -i eos /workspace/process/deepseek-ai_deepseek-coder-33b-instruct/source/tokenizer_config.json | cat -
ve
  "add_eos_token": false,$
  "bos_token": {$
    "__type": "AddedToken",$
    "content": "<M-oM-=M-^\beginM-bM-^VM-^AofM-bM-^VM-^AsentenceM-oM-=M-^\>",$
    "lstrip": false,$
    "normalized": true,$
--$
  "eos_token": {$
    "__type": "AddedToken",$
    "content": "<M-oM-=M-^\endM-bM-^VM-^AofM-bM-^VM-^AsentenceM-oM-=M-^\>",$
    "lstrip": false,$
    "normalized": true,$
    "rstrip": false,$

I don't know why they've used those weird chars, but this isn't a llama.cpp issue; it's using the tokens as defined by the original model.

FYI I'm just about to re-make all the GGUFs after an update to the convert.py I'm using, which affects special tokens. It won't change this, but might affect other aspects of special token usage.

JorritJ

Nov 6, 2023

That weird combination of characters is probably to reduce the odds of them being present in random input.

The output being garbled on dranger003's run is just a console character set issue.

TheBloke
/

deepseek-coder-33B-instruct-GGUF

Weird tokens