-
LLM in a flash: Efficient Large Language Model Inference with Limited Memory
Paper • 2312.11514 • Published • 252 -
PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU
Paper • 2312.12456 • Published • 40 -
Accelerating LLM Inference with Staged Speculative Decoding
Paper • 2308.04623 • Published • 20 -
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
Paper • 2208.07339 • Published • 4
lhl
leonardlin
AI & ML interests
None yet
Organizations
Collections
19
-
QuIP: 2-Bit Quantization of Large Language Models With Guarantees
Paper • 2307.13304 • Published • 1 -
SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression
Paper • 2306.03078 • Published • 2 -
OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models
Paper • 2308.13137 • Published • 14 -
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
Paper • 2306.00978 • Published • 5
models
None public yet
datasets
None public yet