CritiPrefill: A Segment-wise Criticality-based Approach for Prefilling Acceleration in LLMs Paper • 2409.12490 • Published Sep 19 • 2
InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management Paper • 2406.19707 • Published Jun 28 • 2
Expediting and Elevating Large Language Model Reasoning via Hidden Chain-of-Thought Decoding Paper • 2409.08561 • Published Sep 13 • 2
Diver: Large Language Model Decoding with Span-Level Mutual Information Verification Paper • 2406.02120 • Published Jun 4 • 2
EMS-SD: Efficient Multi-sample Speculative Decoding for Accelerating Large Language Models Paper • 2405.07542 • Published May 13 • 2
PipeInfer: Accelerating LLM Inference using Asynchronous Pipelined Speculation Paper • 2407.11798 • Published Jul 16 • 1 • 2
Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling Paper • 2408.08696 • Published Aug 16 • 2