Huiqiang Jiang PRO
iofu728
AI & ML interests
None yet
Articles
Organizations
Posts
2
Post
1003
Weclome to use MInference, which leverages the dynamic sparse nature of LLMs' attention, which exhibits some static patterns, to speed up the pre-filling for million tokens LLMs. It first determines offline which sparse pattern each head belongs to, then approximates the sparse index online and dynamically computes attention with the optimal custom kernels. This approach achieves up to a 10x speedup for pre-filling on an A100 while maintaining accuracy with 1M tokens.
For more detail please check,
project page: https://aka.ms/MInference
code: https://github.com/microsoft/MInference
paper: MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention (2407.02490)
hf demo: microsoft/MInference
For more detail please check,
project page: https://aka.ms/MInference
code: https://github.com/microsoft/MInference
paper: MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention (2407.02490)
hf demo: microsoft/MInference
Post
1298
Welcome to LLMLingua-2, a small-size yet powerful prompt compression method trained via data distillation from GPT-4 for token classification with a BERT-level encoder, excels in task-agnostic compression. It surpasses LLMLingua in handling out-of-domain data, offering 3x-6x faster performance.
@qianhuiwu
website: https://llmlingua.com/llmlingua2.html
code: https://github.com/microsoft/LLMLingua
demo: microsoft/llmlingua-2
website: https://llmlingua.com/llmlingua2.html
code: https://github.com/microsoft/LLMLingua
demo: microsoft/llmlingua-2
models
None public yet
datasets
None public yet