Tokenizer issues?

#3
by xhyi - opened

I am running it via llama.cpp server, with the ChatML format, and I see that the model still outputs as raw text "<|im_end|>".
Looking at the model info printed at startup:

llm_load_print_meta: general.name     = D:\LLM_MODELS\ShinojiResearch
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'

I am not sure how to correct this, do you know?

These match the origianl repo at https://huggingface.co/ShinojiResearch/Senku-70B-Full/blob/main/tokenizer_config.json.
Are you having some kind of issue when special tokens?

I did some further investigation, and it doesn't look like it's the quant's fault. The model in general just seems to have issues following ChatML, outputting things like <im_end|> (broken chatml end token) after user messages. It also happens very frequently.

I noticed that this post who tested it encountered the same exact issue as well.

It is probably due to Miqu being originally a Mistral-prompted model but being finetuned into a ChatML prompted model (Senku). I will probably report this to the original author.

Sign up or log in to comment