Question About KV Cache Efficiency

#2
by ineedquants - opened

Since I didn’t know how to use Hugging Face, I accidentally wrote a comment on a different topic. This is an addition to my previous comment. As I mentioned earlier, the KV cache is much larger compared to the Qwen 3.5 models. After the suggestion to set the KV cache to Q8, I became curious: is the MiniMax architecture less efficient in terms of KV caching compared to Qwen 3.5? Or is there an optimization that hasn’t been implemented yet? (I don’t want to imply that anything was done incorrectly.)

You're entirely right. The difference you're noticing is architectural, not a missing implementation:

  • MiniMax-M2.7 uses traditional GQA with ratio 6:1 (48 query heads, 8 KV heads) and n_embd_k_gqa = 1024
  • Qwen 3 MoE models use GQA ratio 16:1 and n_embd_k_gqa = 512, roughly half the KV cache per layer
  • DeepSeek V3 goes further with MLA (Multi-head Latent Attention), getting KV cache down ~5× vs standard GQA

MiniMax chose to prioritize MoE expert expansion (256 experts, top-8 routing) and wide native context (196K positions), and stuck with standard GQA at a moderate compression ratio. That's a training-time decision. llama.cpp (or any inference framework) just implements what the architecture specifies.

The only real knob you have on the inference side is KV cache quantization --cache-type-k q8_0 --cache-type-v q8_0 halves KV footprint with essentially no quality loss. For this model that's especially worthwhile at longer context.

Ah, thank you very much. I am still learning. I just noticed that the RAM allocation works differently compared to other models. While this architecture seems to have a fixed memory allocation, Qwen 3.5 appears to use a somewhat hybrid approach.
Just for my understanding: with IQ4_NL, how does this quantization compare to Q4_K_M in real-world scenarios? I’m considering replacing Qwen 3.5 122B with a reasonably competent MiniMax setup, mainly for coding-related tasks.

Sign up or log in to comment