Is the KV cache of these models unusually high?

#6
by Hugsanir - opened

I noticed that for a 2048 token context window, llama.cpp allocates 2560MiB for the KV cache which seems extraordinarily high. This is without any KV quantization.

Here is a table I threw together with various models and their KV cache sizes at 2048 context sizes. They are all GGUF quantized to varying degrees but that doesn't seem to make a difference. Try to spot the outlier 😁

Model Params KV Keys Values
Mixtral-8x7B-Holodeck-v1 (48B) 256 MiB 128 MiB 128 MiB
Meta-Llama-3-8B-Instruct (8B) 256 MiB 128 MiB 128 MiB
Meta-Llama-3-70B-Instruct (70B) 640 MiB 320 MiB 320 MiB
Qwen1.5-32B-Chat (32B) 512 MiB 256 MiB 256 MiB
Yi-1.5-34B-Chat (34B) 480 MiB 240 MiB 240 MiB
functionary-small-v2.4 (7B) 256 MiB 128 MiB 128 MiB
c4ai-command-r-v01 (35B) 2560 MiB 1280 MiB 1280 MiB

@Hugsanir Yes, that is expected mostly since the model doesn't use GQA (as opposed to Command-R+).

Sign up or log in to comment