S^{3}: Increasing GPU Utilization during Generative Inference for Higher Throughput Paper • 2306.06000 • Published Jun 9, 2023 • 1
PyramidInfer: Pyramid KV Cache Compression for High-throughput LLM Inference Paper • 2405.12532 • Published 23 days ago
SqueezeAttention: 2D Management of KV-Cache in LLM Inference via Layer-wise Optimal Budget Paper • 2404.04793 • Published Apr 7
MiniCache: KV Cache Compression in Depth Dimension for Large Language Models Paper • 2405.14366 • Published 21 days ago
GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM Paper • 2403.05527 • Published Mar 8
KV-Runahead: Scalable Causal LLM Inference by Parallel Key-Value Cache Generation Paper • 2405.05329 • Published May 8