Papers
arxiv:2406.11430

A Simple and Effective L_2 Norm-Based Strategy for KV Cache Compression

Published on Jun 17
· Submitted by yuzhaouoe on Jun 18
Authors:
,
,

Abstract

The deployment of large language models (LLMs) is often hindered by the extensive memory requirements of the Key-Value (KV) cache, especially as context lengths increase. Existing approaches to reduce the KV cache size involve either fine-tuning the model to learn a compression strategy or leveraging attention scores to reduce the sequence length. We analyse the attention distributions in decoder-only Transformers-based models and observe that attention allocation patterns stay consistent across most layers. Surprisingly, we find a clear correlation between the L_2 and the attention scores over cached KV pairs, where a low L_2 of a key embedding usually leads to a high attention score during decoding. This finding indicates that the influence of a KV pair is potentially determined by the key embedding itself before being queried. Based on this observation, we compress the KV cache based on the L_2 of key embeddings. Our experimental results show that this simple strategy can reduce the KV cache size by 50% on language modelling and needle-in-a-haystack tasks and 90% on passkey retrieval tasks without losing accuracy.

Community

Paper author Paper submitter

This paper finds a clear correlation between the L2 Norm and the attention scores over cached KV pairs, where a low L2 Norm of a key embedding usually leads to a high attention score during decoding. This finding indicates that the influence of a KV pair is potentially determined by the key embedding itself before being queried. Based on this observation, this paper compresses the KV cache based on the L2 Norm of key embeddings. The experimental results show that this simple strategy can reduce the KV cache size by 50% on language modelling and needle-in-a-haystack tasks and 90% on passkey retrieval tasks without losing accuracy.

do you expect this technique to perform well in combination with other kv compression techniques like quantization or layer sharing

·
Paper author

Yeah!

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2406.11430 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2406.11430 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2406.11430 in a Space README.md to link it from this page.

Collections including this paper 2