arxiv:2407.02490

MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention

Published on Jul 2

· Submitted by

iofu728 on Jul 3

Upvote

Authors:

Huiqiang Jiang ,

Yucheng Li ,

Chengruidong Zhang ,

Qianhui Wu ,

Xufang Luo ,

Zhenhua Han ,

Abstract

The computational challenges of Large Language Model (LLM) inference remain a significant barrier to their widespread deployment, especially as prompt lengths continue to increase. Due to the quadratic complexity of the attention computation, it takes 30 minutes for an 8B LLM to process a prompt of 1M tokens (i.e., the pre-filling stage) on a single A100 GPU. Existing methods for speeding up prefilling often fail to maintain acceptable accuracy or efficiency when applied to long-context LLMs. To address this gap, we introduce MInference (Milliontokens Inference), a sparse calculation method designed to accelerate pre-filling of long-sequence processing. Specifically, we identify three unique patterns in long-context attention matrices-the A-shape, Vertical-Slash, and Block-Sparsethat can be leveraged for efficient sparse computation on GPUs. We determine the optimal pattern for each attention head offline and dynamically build sparse indices based on the assigned pattern during inference. With the pattern and sparse indices, we perform efficient sparse attention calculations via our optimized GPU kernels to significantly reduce the latency in the pre-filling stage of long-context LLMs. Our proposed technique can be directly applied to existing LLMs without any modifications to the pre-training setup or additional fine-tuning. By evaluating on a wide range of downstream tasks, including InfiniteBench, RULER, PG-19, and Needle In A Haystack, and models including LLaMA-3-1M, GLM4-1M, Yi-200K, Phi-3-128K, and Qwen2-128K, we demonstrate that MInference effectively reduces inference latency by up to 10x for pre-filling on an A100, while maintaining accuracy. Our code is available at https://aka.ms/MInference.

View arXiv page View PDF Add to collection

Community

iofu728

Paper author Paper submitter 2 days ago

MInference 1.0 leverages the dynamic sparse nature of LLMs' attention, which exhibits some static patterns, to speed up the pre-filling for long-context LLMs. It first determines offline which sparse pattern each head belongs to, then approximates the sparse index online and dynamically computes attention with the optimal custom kernels. This approach achieves up to a 10x speedup for pre-filling on an A100 while maintaining accuracy with 1M tokens.

For more detail please check our project page-aka.ms/MInference, and code.

iofu728

Paper author Paper submitter 2 days ago

•

edited 2 days ago

Due to an issue with arXiv, the PDF is currently unavailable there. You can find the paper at this link.

Bachstelze

1 day ago

What is the method used to obtain the graphical attention patterns?

iofu728

Paper author 1 day ago

Hi @Bachstelze , first, we identified three sparse patterns in attention heads through observation. We determined the optimal sparse pattern for each head using offline search, as described in Section 3.2.1. Subsequently, we utilized online approximate dynamic sparse indexing and sparse calculations to accelerate LLMs inference.