Papers
arxiv:2306.14048

H_2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models

Published on Jun 24, 2023
· Featured in Daily Papers on Jun 27, 2023
Authors:
,
,
,
,
,
,
,

Abstract

Large Language Models (LLMs), despite their recent impressive accomplishments, are notably cost-prohibitive to deploy, particularly for applications involving long-content generation, such as dialogue systems and story writing. Often, a large amount of transient state information, referred to as the KV cache, is stored in GPU memory in addition to model parameters, scaling linearly with the sequence length and batch size. In this paper, we introduce a novel approach for implementing the KV cache which significantly reduces its memory footprint. Our approach is based on the noteworthy observation that a small portion of tokens contributes most of the value when computing attention scores. We call these tokens Heavy Hitters (H_2). Through a comprehensive investigation, we find that (i) the emergence of H_2 is natural and strongly correlates with the frequent co-occurrence of tokens in the text, and (ii) removing them results in significant performance degradation. Based on these insights, we propose Heavy Hitter Oracle (H_2O), a KV cache eviction policy that dynamically retains a balance of recent and H_2 tokens. We formulate the KV cache eviction as a dynamic submodular problem and prove (under mild assumptions) a theoretical guarantee for our novel eviction algorithm which could help guide future work. We validate the accuracy of our algorithm with OPT, LLaMA, and GPT-NeoX across a wide range of tasks. Our implementation of H_2O with 20% heavy hitters improves the throughput over three leading inference systems DeepSpeed Zero-Inference, Hugging Face Accelerate, and FlexGen by up to 29times, 29times, and 3times on OPT-6.7B and OPT-30B. With the same batch size, H2O can reduce the latency by up to 1.9times. The code is available at https://github.com/FMInference/H2O.

Community

Proposes Heavy Hitter Oracle (H2O): A method to reduce the memory (GPU VRAM) footprint of KV cache in case of (transformer-based) LLMs (long sequence inference); KV cache eviction policy to retain H2 tokens). Heavy Hitter (H2) tokens: Some tokens are significant in output quality/results. Attention matrices of LLMs are very sparse; scaling shows power law; H2 is near-greedy. Higher throughput (rate) than DeepSpeed Zero-Inference, HuggingFace Accelerate, and FlexGen (with lower latency). Deployment strategies: quantisation (reduce size of model weights), pruning/sparsity (this work does it on KV cache), and (knowledge) distillation. Eviction policy is evicting (removing) at most one key (maintain the KV cache to the same budget). Cache information is normalized attention (using keys of retained tokens), generative process of cache eviction/refinement. Find KV eviction policy to match output without limiting cache size. Observes attention sparsity on OPT (pre-trained LLM) layers (evaluated on Wiki-Text-103): deeper layers have high sparsity; confirms presence of H2 by using correlation and performance decline by removing them. H2 eviction algorithm uses a scoring function based on past attention values (cache information sum). Empirical evaluation shows that it reduces memory footprint, increases inference throughput, and retains performance benefits under different sequence lengths and quantisation. Appendix has implementation details, extended related works, experiments, and detailed theoretical analysis, lemmas, results, and definitions (including gradients and hessians - which is positive definite and Lipschitz). From UT Austin, Stanford University, UCSD, UC Berkeley, Adobe, Meta, CMU.

Links: GitHub

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2306.14048 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2306.14048 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2306.14048 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.