FlashDecoding++: Faster Large Language Model Inference on GPUs Paper • 2311.01282 • Published Nov 2, 2023 • 35
S-LoRA: Serving Thousands of Concurrent LoRA Adapters Paper • 2311.03285 • Published Nov 6, 2023 • 28
Parameter-Efficient Orthogonal Finetuning via Butterfly Factorization Paper • 2311.06243 • Published Nov 10, 2023 • 17
FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor Cores Paper • 2311.05908 • Published Nov 10, 2023 • 12
Tied-Lora: Enhacing parameter efficiency of LoRA with weight tying Paper • 2311.09578 • Published Nov 16, 2023 • 14
I&S-ViT: An Inclusive & Stable Method for Pushing the Limit of Post-Training ViTs Quantization Paper • 2311.10126 • Published Nov 16, 2023 • 7
A Survey of Resource-efficient LLM and Multimodal Foundation Models Paper • 2401.08092 • Published Jan 16 • 3
SliceGPT: Compress Large Language Models by Deleting Rows and Columns Paper • 2401.15024 • Published Jan 26 • 69
EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty Paper • 2401.15077 • Published Jan 26 • 19
Decoding Compressed Trust: Scrutinizing the Trustworthiness of Efficient LLMs Under Compression Paper • 2403.15447 • Published Mar 18 • 16
A Controlled Study on Long Context Extension and Generalization in LLMs Paper • 2409.12181 • Published Sep 18 • 43