LoneStriker/deepseek-coder-7b-instruct-v1.5-GGUF · Weirdness with offloading to VRAM

Feb 27

Why are all the layers of this model with 4096 context not fit in the VRAM, but other 7B Q6_K models with 8192 context do? This model with 25 offloaded layers takes up more VRAM than any other 7B Q6_K model with 33 layers.

Bjorno

Feb 27

•

edited Feb 27

Hmm, I tried theBloke/deepseek-coder-6.7B-instruct-GGUF Q6_K and it has the same problem, but I don't understand why this happens. Yeah, I don't have a lot of VRAM (8 GB), but that's always been enough to fully offload any other 7B Q6_K models with 8k context and certainly 4k into it.

This is the first time I've had to reduce the context as much as 2k to fully offload a model into VRAM.

Maybe I just don't understand how DeepSeek Coder models are designed.

UPD: In koboldcpp, unlike LM Studio, the model is loaded with 4k context and fully offloaded into VRAM, but almost at the edge.

LoneStriker

Owner Feb 27

I unfortunately don't have a ton of experience with low-VRAM offloads. You've done what I would have done which is to reduce the context until it fits.

Gryphe

Mar 14

I just loaded the 6_K version on my 12gb 3060 and it's using 11.1 out of 12.2 GB available.

If I remember correctly this is due to the hugely increased size of the vocabulary. (100k+ versus the Llama's usual 32k)