The Impact of Hyperparameters on Large Language Model Inference Performance: An Evaluation of vLLM and HuggingFace Pipelines Paper • 2408.01050 • Published Aug 2 • 8
Efficient Inference of Vision Instruction-Following Models with Elastic Cache Paper • 2407.18121 • Published Jul 25 • 16
LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference Paper • 2407.14057 • Published Jul 19 • 44
Q-Sparse: All Large Language Models can be Fully Sparsely-Activated Paper • 2407.10969 • Published Jul 15 • 20
Inference Performance Optimization for Large Language Models on CPUs Paper • 2407.07304 • Published Jul 10 • 52