This repo only contains the AttnGates' weights for Qwen2.5-14B-Instruct Model.
SeerAttention introduces learnable AttnGate modules to accelerate the computationally intensive prefill stage of long-context large language models (LLMs) via dynamic block-level sparsity. The AttnGates are trained in a parameter-efficient self-distillation framework, where they learn to mimic the 2D max-pooled attention patterns of the original frozen model, preserving its integrity while avoiding costly retraining. During inference, these gates generate block-sparse binary masks by applying threshold/TopK to their learned soft scores, enabling efficient computation through a custom block-sparse FlashAttention kernel.
Original Github Repo
https://github.com/microsoft/SeerAttention.
Evaluation Results
PG19 PPL
Density | 8192 tokens (ppl) | 16384 tokens (ppl) | 32768 tokens (ppl) |
---|---|---|---|
0.10 | 8.62 | 8.23 | 8.17 |
0.20 | 8.32 | 8.08 | 8.06 |
0.30 | 8.23 | 8.02 | 8.03 |
0.40 | 8.19 | 8.00 | 8.01 |
0.50 | 8.17 | 7.99 | 8.00 |
1.00 | 8.16 | 7.99 | 8.00 |
LongBench
Dataset | 0-4k (Dense / Sparse) | 4-8k (Dense / Sparse) | 8k+ (Dense / Sparse) |
---|---|---|---|
qasper | 47.23/48.05 | 37.51/37.20 | 35.26/36.49 |
multifieldqa_en | 56.40/56.10 | 47.13/47.36 | 48.64/50.36 |
lcc | 62.32/63.25 | 67.48/66.58 | 61.47/63.53 |
gov_report | 34.26/34.30 | 34.06/33.70 | 33.02/32.52 |
2wikimqa | 51.29/52.13 | 48.03/47.78 | 31.68/30.90 |
multi_news | 26.46/26.21 | 23.71/23.55 | 22.42/22.58 |
samsum | 42.97/42.95 | 41.08/40.23 | 44.88/44.62 |
passage_count | 20.00/19.00 | 07.00/06.00 | 08.00/08.00 |
repobench-p | 64.17/63.63 | 64.87/64.61 | 57.85/58.60 |
trec | 60.00/60.00 | 75.00/74.00 | 71.00/71.00 |
hotpotqa | 58.57/57.16 | 56.87/55.91 | 56.18/56.99 |
triviaqa | 87.63/87.35 | 88.38/90.00 | 88.49/90.15 |
passage_retrieval_en | 99.00/99.00 | 100.0/100.0 | 100.0/100.0 |
averaged score | 54.64/54.55 | 53.16/52.84 | 50.68/51.21 |
averaged density | 0.841 | 0.624 | 0.379 |
LongBenchV2 CoT Benchmark
All the SeerAttention models run with threshold=5e-4.
For R1-Distilled models, we remove the two passes generation setup (think + summary), we directly ask the models to output anwser after thinking. The generation max length is set to 10240.
Model | Overall | Easy | Hard | Short | Medium | Long |
---|---|---|---|---|---|---|
Llama-3.1-8B-Instruct | 30.4 | 31.2 | 29.9 | 37.8 | 24.7 | 29.6 |
SeerAttention-Llama-3.1-8B | 31.6 | 33.3 | 30.5 | 33.9 | 31.6 | 27.8 |
Qwen2.5-14B-Instruct | 34.8 | 37.5 | 33.1 | 44.4 | 32.1 | 24.1 |
SeerAttention-Qwen2.5-14B | 32.8 | 38.0 | 29.6 | 45.0 | 30.2 | 17.6 |
Qwen2.5-32B-Instruct | 36.4 | 42.2 | 32.8 | 47.8 | 29.8 | 30.6 |
SeerAttention-Qwen2.5-32B | 36.4 | 41.1 | 33.4 | 49.4 | 29.8 | 27.8 |
DeepSeek-R1-Distill-Qwen-14B | 34.2 | 43.2 | 28.6 | 45.0 | 27.9 | 28.7 |
SeerAttention-DeepSeek-R1-Distill-Qwen-14B | 31.6 | 35.9 | 28.9 | 41.7 | 26.0 | 25.9 |
DeepSeek-R1-Distill-Qwen-32B | 37.2 | 42.7 | 33.8 | 47.2 | 35.8 | 23.1 |
SeerAttention-DeepSeek-R1-Distill-Qwen-32B | 37.0 | 42.2 | 33.8 | 49.4 | 31.6 | 26.9 |
- Downloads last month
- 129