arxiv:2502.14882

From 16-Bit to 1-Bit: Visual KV Cache Quantization for Memory-Efficient Multimodal Large Language Models

Published on Feb 15

Authors:

Abstract

Multimodal Large Language Models (MLLMs) have achieved remarkable success across various applications, yet their computational overhead during deployment remains a critical challenge. While Key-Value (KV) caching improves inference efficiency by trading memory for computation, the growing memory footprint from storing extensive KV caches reduces throughput and limits long-term execution on devices with constrained GPU memory. Existing approaches primarily focus on dropping unimportant tokens to reduce the KV cache size, mitigating memory constraints at the cost of potential information loss. In contrast, we propose a simple yet effective visual quantization strategy that preserves all visual tokens while significantly reducing memory consumption. To achieve an extreme quantization ratio, i.e., 1-bit quantization, we propose group-specific quantization and quantile-based quantization approaches, motivated by the inherent patterns of the KV cache. Our method is plug-and-play, enabling seamless integration into various MLLMs to improve memory efficiency without architectural modifications. Extensive experiments demonstrate that our approach effectively reduces memory overhead while maintaining computational efficiency and preserving multimodal performance.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2502.14882 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2502.14882 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2502.14882 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.