dranger003/c4ai-command-r-v01-iMat.GGUF · Is the KV cache of these models unusually high?

I noticed that for a 2048 token context window, llama.cpp allocates 2560MiB for the KV cache which seems extraordinarily high. This is without any KV quantization.

Here is a table I threw together with various models and their KV cache sizes at 2048 context sizes. They are all GGUF quantized to varying degrees but that doesn't seem to make a difference. Try to spot the outlier 😁

Model	Params	KV	Keys	Values
Mixtral-8x7B-Holodeck-v1	(48B)	256 MiB	128 MiB	128 MiB
Meta-Llama-3-8B-Instruct	(8B)	256 MiB	128 MiB	128 MiB
Meta-Llama-3-70B-Instruct	(70B)	640 MiB	320 MiB	320 MiB
Qwen1.5-32B-Chat	(32B)	512 MiB	256 MiB	256 MiB
Yi-1.5-34B-Chat	(34B)	480 MiB	240 MiB	240 MiB
functionary-small-v2.4	(7B)	256 MiB	128 MiB	128 MiB
c4ai-command-r-v01	(35B)	2560 MiB	1280 MiB	1280 MiB