Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads Paper • 2401.10774 • Published Jan 19 • 50
SliceGPT: Compress Large Language Models by Deleting Rows and Columns Paper • 2401.15024 • Published Jan 26 • 62
Speculative Streaming: Fast LLM Inference without Auxiliary Models Paper • 2402.11131 • Published Feb 16 • 41
SLAB: Efficient Transformers with Simplified Linear Attention and Progressive Re-parameterized Batch Normalization Paper • 2405.11582 • Published 13 days ago • 10