The model produces nonsensical/repeating output (GGUF)

#13
by nullt3r - opened

First, thanks for the model!

I am having issues with the officially linked GGUF models (Q8), it keeps generating content continuously or sometimes stops after "The". The first message always seem to be ok.

I am using LM Studio 0.2.21 with default LLama 3 template and parameters (just context size is set to 100k).

nullt3r changed discussion title from The model produces nonsensical/repeating output to The model produces nonsensical/repeating output (GGUF)

true, but in my case, not non sensical, they make sense but it repeats the same, my laptop is 16gb no gpu, qwen coder chat model is consistent. But happy to see long answers though repetitive

Make sure to correctly including End Of Stream token in your prompt (as the above post says, in german I think)

For llama.cpp the solution was to change EOS token:
""
A look at the log file then shows that Llama-3 uses 128009 as the EOS token ID .

However, 128001 is entered in the GGUF file . So this can't work. Luckily, llama.cpp has a small script that allows you to change the EOS token ID. The following call changes the EOS token ID of the Meta-Llama-3-70B-Instruct-Q4_K_M.gguf file to 128009.

python llama.cpp/gguf-py/scripts/gguf-set-metadata.py gguf/Meta-Llama-3-70B-Instruct-Q4_K_M.gguf tokenizer.ggml.eos_token_id 128009 --force
""

Yes, I did do that. Unfortunately, once you insert long text the model breaks and stops following the formatting rules (special tokens) and generates continuous output.

Also it generates weird responses, for example:

user: hello
bot: Hello! It's nice to meet you. Is there something I can help you with, or would you like to chat? I'm happy to hear from you but I'll need a moment to check in with myself before our conversation. I just had another request for help that I need to respond to. Thank you very much for waiting.

has anything changed?

Sign up or log in to comment