broken output when larger than 4k context

#3
by Bakanayatsu - opened

Adjusting the parameter 'llama.context_length' in gguf from 8192 to 4096 resolves the issue.

llama.cpp doesn't have sliding window support, so it only has 4096 context length.

Automatic rope scaling --contextsize 8192 in koboldcpp:

Automatic RoPE Scaling: Using (scale:1.000, base:10000.0).
llama_new_context_with_model: n_ctx      = 8272
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1

When it should be:

Automatic RoPE Scaling: Using (scale:1.000, base:32000.0).
llama_new_context_with_model: n_ctx      = 8272
llama_new_context_with_model: freq_base  = 32000.0
llama_new_context_with_model: freq_scale = 1

For the fix:
https://github.com/ggerganov/llama.cpp/tree/master/gguf-py/scripts
python gguf-set-metadata.py path/to/model.gguf llama.context_length 4096

This actually affects my imatrix quants too. I'll reupload them.

Owner

Could this be caused by using the wrong tokenizer when quantizing or just mismatched model types?
Does this render my ggufs broken? If so i may just direct people to your imatrix versions and take this repo down as I'm away from my pc for a few days.

This results in backends that do not support Mistral's sliding window attention (such as llama.cpp/koboldcpp) not being able to load 8192 context correctly. Using --contextsize 8192 results in koboldcpp determining that the model is 8192 (when it can only do 4096 since SWA is unsupported), so it does not need to "extend" the max context. So, any context greater than 4096 is incoherent.
screenshot-1710860653593.png
llama.context_length should be changed to 4096

Edit: I might be wrong in assuming that SWA is used when training/finetuning SOLAR or is being used when greater than 4096 context since it's retrained with 4096 max context only.

Owner

Screenshot 2024-03-21 003234.png
I haven't used it before but koboldcpps' benchmark says it stays coherent to 8k ctx using Q3_K_M

Owner

Also i just saw this in cmd
Screenshot 2024-03-21 004154.png
Would this mean it is automatically correcting the incorrect config?

Screenshot 2024-03-21 003234.png
I haven't used it before but koboldcpps' benchmark says it stays coherent to 8k ctx using Q3_K_M

Are you sure you are running --contextsize 8192 instead of --contextsize 16384?

Owner

The first test was ran with 8k, I ran another with 16k and that's when I noticed it was changing the config.
The line under the file name in the screenshot says
MaxCtx: 8192

If you have first benchmarked --contextsize 8192 it should output Coherent: False

Owner

I've uploaded the resulting csv files to the repo, i believe they're correct
The commands used were
koboldcpp.exe --contextsize 8192 --benchmark results.csv --model C:\Users\sai\Downloads\Fimbulvetr-Kuro-Lotus-10.7B-Q3_K_M.gguf
koboldcpp.exe --contextsize 16384 --benchmark results.csv --model C:\Users\sai\Downloads\Fimbulvetr-Kuro-Lotus-10.7B-Q3_K_M.gguf

For some reason, koboldcpp benchmark is outputting Coherent: True even though the outputs are gibberish. Try to test with real examples, a prompt with greater than 4096 context.
screenshot-1710941815947.pngscreenshot-1710941768431.png

Owner

Can confirm it is broken, when using multi turn conversation it goes insane past 4k but when feeding in one large chunk of context larger than 4k it remains coherent.
It's cursed

Was initially confused too why the same prompt (1000 tokens) * 5 = 5k would produce coherent text. But multi turn with unique replies would break past 4k.

Owner

Thank you for all the effort, i wouldn't of had a clue where to start.

Sign up or log in to comment