DuoDecoding: Hardware-aware Heterogeneous Speculative Decoding with Dynamic Multi-Sequence Drafting Paper • 2503.00784 • Published 5 days ago • 9
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs Paper • 2503.01743 • Published 4 days ago • 60
FR-Spec: Accelerating Large-Vocabulary Language Models via Frequency-Ranked Speculative Sampling Paper • 2502.14856 • Published 15 days ago • 7
Iterative Value Function Optimization for Guided Decoding Paper • 2503.02368 • Published 3 days ago • 14
PipeOffload: Improving Scalability of Pipeline Parallelism with Memory Optimization Paper • 2503.01328 • Published 4 days ago • 14
Running 2.1k 2.1k The Ultra-Scale Playbook 🌌 The ultimate guide to training LLM on large GPU Clusters
SpargeAttn: Accurate Sparse Attention Accelerating Any Model Inference Paper • 2502.18137 • Published 10 days ago • 51
Rank1: Test-Time Compute for Reranking in Information Retrieval Paper • 2502.18418 • Published 10 days ago • 25
MoBA: Mixture of Block Attention for Long-Context LLMs Paper • 2502.13189 • Published 17 days ago • 13
LightThinker: Thinking Step-by-Step Compression Paper • 2502.15589 • Published 14 days ago • 26
LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention Paper • 2502.14866 • Published 15 days ago • 12
Autellix: An Efficient Serving Engine for LLM Agents as General Programs Paper • 2502.13965 • Published 16 days ago • 18
HeadInfer: Memory-Efficient LLM Inference by Head-wise Offloading Paper • 2502.12574 • Published 17 days ago • 11
InfiniteHiP: Extending Language Model Context Up to 3 Million Tokens on a Single GPU Paper • 2502.08910 • Published 22 days ago • 142
Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention Paper • 2502.11089 • Published 19 days ago • 139
TransMLA: Multi-head Latent Attention Is All You Need Paper • 2502.07864 • Published 24 days ago • 46