LWDCLS/LLM-Discussions · [llama.cpp PR#7527] GGUF Quantized KV Support

LWDCLS Research org 24 days ago

•

Available in the KoboldCpp builds from Nexesenex:
https://github.com/Nexesenex/kobold.cpp/releases/tag/v1.67b_b3066

The legend as always, providing it for the thirsty early adopters.

This is actually so huge, honestly I can almost double my --contextsize now. It's a straight up +50% boost at least.

Lewdiculous changed discussion title from [llama.cpp PR#7527] Quantized KV Support to [llama.cpp PR#7527] GGUF Quantized KV Support 24 days ago

Lewdiculous

LWDCLS Research org 24 days ago

•

edited 24 days ago

Lewdiculous

LWDCLS Research org 24 days ago

•

edited 24 days ago

For me, right now, as soon as your context is full and you trigger Context Shifting it crashes.

[Context Shifting: Erased 140 tokens at position 1636]GGML_ASSERT: U:\GitHub\kobold.cpp\ggml-cuda\rope.cu:255: src0->type == GGML_TYPE_F32 || src0->type == GGML_TYPE_F16
<CRASH>

But very promising, for this point of the implementation.

saishf

24 days ago

only 3063 q8_0 has passed all tests flawlessly

This will be the most stable and may fix the crash issue
https://github.com/Nexesenex/kobold.cpp/releases/download/v1.67b_b3066/koboldcpp_cuda_12.2_K8_V51.exe

Lewdiculous

LWDCLS Research org 24 days ago

This is the one I was using to test already actually. Have you successfully Context Shifted? I tested with --contextsize 6144 in an existing conversation that was about 8K long.

ABX-AI

24 days ago

This is the one I was using to test already actually. Have you successfully Context Shifted? I tested with --contextsize 6144 in an existing conversation that was about 8K long.

CtxLimit: 58/8192, Process:0.43s (10.2ms/T = 97.67T/s), Generate:3.01s (188.2ms/T = 5.31T/s), Total:3.44s (4.65T/s)
CtxLimit: 8192/8192, Process:34.21s (4.4ms/T = 227.17T/s), Generate:185.17s (440.9ms/T = 2.27T/s), Total:219.37s (1.91T/s)
[Context Shifting: Erased 420 tokens at position 2]GGML_ASSERT: ......\ggml.c:14700: false
GGML_ASSERT: ......\ggml.c:14700: false
GGML_ASSERT: ......\ggml.c:14700: false
GGML_ASSERT: ......\ggml.c:14700: false
GGML_ASSERT: ......\ggml.c:14700: false
GGML_ASSERT: ......\ggml.c:14700: false
GGML_ASSERT: ......\ggml.c:14700: false
GGML_ASSERT: ......\ggml.c:14700: false
GGML_ASSERT: ......\ggml.c:14700: false
GGML_ASSERT: ......\ggml.c:14700: false
GGML_ASSERT: ......\ggml.c:14700: false

Yeah - same for me, getting this whenever it shifts.
It's more than a welcome update in any case and I'm sure they'll fix this soon

saishf

24 days ago

[Context Shifting: Erased 165 tokens at position 1686]
Processing Prompt (24 / 24 tokens)GGML_ASSERT: U:\GitHub\kobold.cpp\ggml-cuda\rope.cu:255: src0->type == GGML_TYPE_F32 || src0->type == GGML_TYPE_F16

[process exited with code 3221226505 (0xc0000409)]

All i can find for the error code is memory corruption,
I tried K8 V5_1 and KV5_1 with no luck, and like every setting in the gui
Also tried the old smart context, it also dies

saishf

24 days ago

•

edited 24 days ago

Just to experiment, this is Llama-3-8B-Q6_K @ 16K(almost) with a Llava mmproj loaded

Yi-9B-32K-Q5_K_M @ 32K

Lewdiculous

LWDCLS Research org 24 days ago

Yi really is super skinny, uh?

saishf

24 days ago

Yi really is super skinny, uh?

Yi's context is tiny, its magic 😭
I've been waiting for a Yi-9B rp model for a while, it's really smart and has better reasoning than most models i've tried in instruct.
Plus it has a native 16K chat version. Something that could actually be useful if quantization of KV becomes stable enough.

saishf

24 days ago

Also with some testing there is about a 1/3 reduction in gen speed with KV quantization
35T/s -> 25T/s
Can't really tell a difference though, I can't read that fast

ABX-AI

23 days ago

Also with some testing there is about a 1/3 reduction in gen speed with KV quantization
35T/s -> 25T/s
Can't really tell a difference though, I can't read that fast

upcoming llama.cpp fork has a multi-threaded 👀 update - read a different word with each eye at the same time for x2 token reading speed

saishf

23 days ago

Also with some testing there is about a 1/3 reduction in gen speed with KV quantization
35T/s -> 25T/s
Can't really tell a difference though, I can't read that fast

upcoming llama.cpp fork has a multi-threaded 👀 update - read a different word with each eye at the same time for x2 token reading speed

That would be super helpful for higher context while using KV quanting, past 32K context takes forever to read the tokens
With KV quant 0 ->14k ctx averages 600T/s ingestion
Without KV quant 0 -> 14k averages 1100T/s ingestion
I imagine the difference would be quite noticeable when ingesting 32-64K ctx

Lewdiculous

LWDCLS Research org 22 days ago

•

edited 22 days ago

KoboldCpp 1.67:

You can now utilize the Quantized KV Cache feature in KoboldCpp with --quantkv [level], where level 0=f16, 1=q8, 2=q4. Note that quantized KV cache is only available if --flashattention is used, and is NOT compatible with Context Shifting, which will be disabled if --quantkv is used.

Context Shifting please come home...

saishf

19 days ago

•

edited 19 days ago

Quantized KV cache + Qwen2's context size sorcery is big
Qwen2-7B-Q5_K_M @ 64K ctx & 8 bit cache

I'd be concerned if you needed context shifting + 64K ctx 😭
Edit - Spelling

Lewdiculous

LWDCLS Research org 19 days ago

•

edited 19 days ago

I'd be concerned if you needed context shifting + 64K ctx 😭

LMAO at that point... Yeah, your ERP has gone too far xD

Honestly that's crazy! Qwen2 Q5 at 64K in only 8GB of VRAM?!!

Are there any prominent RP tunes/merges or are you using the original?

saishf

19 days ago

I'd be concerned if you needed context shifting + 64K ctx 😭

LMAO at that point... Yeah, your ERP has gone too far xD

Honestly that's crazy! Qwen2 Q5 at 64K in only 8GB of VRAM?!!

Are there any prominent RP tunes/merges or are you using the original?

I tried the dolphin version, it's uh, interesting?

ABX-AI

19 days ago

•

edited 19 days ago

I'm excited to see qwen 2, however we really need an uncensored RP tune, it's severely censored as expected (coming from China). But CodeQwen7B 1.5 kills it for code so those guys know how to make efficient models. Qwen censorship in general is comical, though :D

It refused to give me a response on safety tips for masturbation because "The request you're making involves activities that can be harmful and potentially illegal. Safety, legality, and ethical considerations are important factors that I must adhere to when providing assistance."

However, they note in the model page that it's good for RP, and I did try it with some of the newer RPG-formatted cards I am working with, and it's doing better than I expected. It's going with the ERP and doing a pretty good job at applying the formatting properly. I would for sure be interested in ERP fine tunes of it. I could run it with 32k context no problem on Q6, but it was running slow. With 8K context at q6 it just flies.

saishf

19 days ago

I'm excited to see qwen 2, however we really need an uncensored RP tune, it's severely censored as expected (coming from China). But CodeQwen7B 1.5 kills it for code so those guys know how to make efficient models. Qwen censorship in general is comical, though :D

It refused to give me tips on safety tips for masturbation because "The request you're making involves activities that can be harmful and potentially illegal. Safety, legality, and ethical considerations are important factors that I must adhere to when providing assistance."

I decided to try the base Qwen2-7B to see what the censorship is like:

From cannibalism

To censorship

It's not doing well 😭

ABX-AI

19 days ago

Lmao, this question with the photography really messes them up badly :D SOLAR also gets it wrong 😭

saishf

19 days ago

The riddle shows just how impressive yi-9B is, it can answer the question right 10/10 times
Plus it can manage the weight questions (kg of feathers and lb of steel)
There's a big lack of Yi 1.5 RP models, yet it's so smart and has native 16K @_@

ABX-AI

19 days ago

Ah, yeah, the Yi one seems interesting. It would be nice to see more rp tunes in that 9-30 range.

saishf

18 days ago

Ah, yeah, the Yi one seems interesting. It would be nice to see more rp tunes in that 9-30 range.

Seeing the performance of the Yi-34B's makes a 24GB gpu so tempting, the reasoning seems better than llama3 70b from playing around with them in the lmsys arena.
And Q4 @ 16K ctx would fit in vram, with cache quanting Q5 might even be possible in vram

[llama.cpp PR#7527] GGUF Quantized KV Support

This thread is for discussions, testing, sharing results, questions, issues, coping, dreams... Anything goes.