Papers
arxiv:2312.12456

PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU

Published on Dec 16, 2023
ยท Featured in Daily Papers on Dec 21, 2023
Authors:
,

Abstract

This paper introduces PowerInfer, a high-speed Large Language Model (LLM) inference engine on a personal computer (PC) equipped with a single consumer-grade GPU. The key underlying the design of PowerInfer is exploiting the high locality inherent in LLM inference, characterized by a power-law distribution in neuron activation. This distribution indicates that a small subset of neurons, termed hot neurons, are consistently activated across inputs, while the majority, cold neurons, vary based on specific inputs. PowerInfer exploits such an insight to design a GPU-CPU hybrid inference engine: hot-activated neurons are preloaded onto the GPU for fast access, while cold-activated neurons are computed on the CPU, thus significantly reducing GPU memory demands and CPU-GPU data transfers. PowerInfer further integrates adaptive predictors and neuron-aware sparse operators, optimizing the efficiency of neuron activation and computational sparsity. Evaluation shows that PowerInfer attains an average token generation rate of 13.20 tokens/s, with a peak of 29.08 tokens/s, across various LLMs (including OPT-175B) on a single NVIDIA RTX 4090 GPU, only 18% lower than that achieved by a top-tier server-grade A100 GPU. This significantly outperforms llama.cpp by up to 11.69x while retaining model accuracy.

Community

With LLM Studio, on a mid-range 8gb vram laptop, I get approx 30 token/sec with a 4bit 7B model

PowerInfer should allow me to run the full precision model at the same speed, or a 4bit quant at approx 240 token/sec (if the fuzzy calculations in my head are right)

This is absolutely nuts. Most exciting paper I've read for a while

If this, or a derivative of this, works with MoE models, this + mixtral is basically ChatGPT local, super fast, super private, super uncensored (on a mid-range laptop) ๐Ÿ‘๐Ÿ‘๐Ÿ‘๐Ÿ‘๐Ÿ‘๐Ÿ‘๐Ÿ‘

Can't wait for the repo to come out. Great paper. ๐Ÿ™

This comment has been hidden

Sign up or log in to comment

Models citing this paper 2

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2312.12456 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2312.12456 in a Space README.md to link it from this page.

Collections including this paper 14