XAttention: Block Sparse Attention with Antidiagonal Scoring Paper • 2503.16428 • Published 13 days ago • 12
🧠Reasoning datasets Collection Datasets with reasoning traces for math and code released by the community • 20 items • Updated 2 days ago • 118
LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention Paper • 2502.14866 • Published Feb 20 • 13
DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads Paper • 2410.10819 • Published Oct 14, 2024 • 7
DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads Paper • 2410.10819 • Published Oct 14, 2024 • 7
DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads Paper • 2410.10819 • Published Oct 14, 2024 • 7 • 2
Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference Paper • 2406.10774 • Published Jun 16, 2024 • 3
QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving Paper • 2405.04532 • Published May 7, 2024
Retrieval Head Mechanistically Explains Long-Context Factuality Paper • 2404.15574 • Published Apr 24, 2024 • 3
InfLLM: Unveiling the Intrinsic Capacity of LLMs for Understanding Extremely Long Sequences with Training-Free Memory Paper • 2402.04617 • Published Feb 7, 2024 • 4
BitDelta: Your Fine-Tune May Only Be Worth One Bit Paper • 2402.10193 • Published Feb 15, 2024 • 22