arxiv:2407.07304

Inference Performance Optimization for Large Language Models on CPUs

Published on Jul 10

· Submitted by

akhaliq on Jul 11

#2 Paper of the day

Upvote

Authors:

Pujiang He ,

Shan Zhou ,

Wenhuan Huang ,

Changqing Li ,

Duyi Wang ,

Bin Guo ,

Chen Meng ,

Sheng Gui ,

Weifei Yu ,

Abstract

Large language models (LLMs) have shown exceptional performance and vast potential across diverse tasks. However, the deployment of LLMs with high performance in low-resource environments has garnered significant attention in the industry. When GPU hardware resources are limited, we can explore alternative options on CPUs. To mitigate the financial burden and alleviate constraints imposed by hardware resources, optimizing inference performance is necessary. In this paper, we introduce an easily deployable inference performance optimization solution aimed at accelerating LLMs on CPUs. In this solution, we implement an effective way to reduce the KV cache size while ensuring precision. We propose a distributed inference optimization approach and implement it based on oneAPI Collective Communications Library. Furthermore, we propose optimization approaches for LLMs on CPU, and conduct tailored optimizations for the most commonly used models. The code is open-sourced at https://github.com/intel/xFasterTransformer.

View arXiv page View PDF Add to collection

Community

akhaliq

Paper submitter Jul 11

https://github.com/intel/xFasterTransformer

MichaelBarryUK

Jul 11

Would these performance gains be useful on a single CPU with a batch size of 1, or would that have insignificant gains compared to MULTI-CPU high batch count. Cheers

ShanShanZhou

Paper author Jul 12

Yes, you are right, this solution can benefit both single CPU and CPU server clusters. if your model is small, you can just leverage one socket, if your model is big, like 70B, and you have a good network connection, you can leverage the whole CPU cluster to do the inference across multi-servers. BTW, the CPU server has a large memory capacity so that large batch can be supported too.