Weirdness with offloading to VRAM

#3
by Bjorno - opened

Why are all the layers of this model with 4096 context not fit in the VRAM, but other 7B Q6_K models with 8192 context do? This model with 25 offloaded layers takes up more VRAM than any other 7B Q6_K model with 33 layers.

Hmm, I tried theBloke/deepseek-coder-6.7B-instruct-GGUF Q6_K and it has the same problem, but I don't understand why this happens. Yeah, I don't have a lot of VRAM (8 GB), but that's always been enough to fully offload any other 7B Q6_K models with 8k context and certainly 4k into it.

This is the first time I've had to reduce the context as much as 2k to fully offload a model into VRAM.

Maybe I just don't understand how DeepSeek Coder models are designed.

UPD: In koboldcpp, unlike LM Studio, the model is loaded with 4k context and fully offloaded into VRAM, but almost at the edge.

I unfortunately don't have a ton of experience with low-VRAM offloads. You've done what I would have done which is to reduce the context until it fits.

I just loaded the 6_K version on my 12gb 3060 and it's using 11.1 out of 12.2 GB available.

If I remember correctly this is due to the hugely increased size of the vocabulary. (100k+ versus the Llama's usual 32k)

Sign up or log in to comment