arxiv:2310.19102

Atom: Low-bit Quantization for Efficient and Accurate LLM Serving

Published on Oct 29, 2023

· Submitted by

akhaliq on Oct 31, 2023

Upvote

Authors:

Yilong Zhao ,

Chien-Yu Lin ,

Zihao Ye ,

Lequn Chen ,

Size Zheng ,

Tianqi Chen ,

Baris Kasikci

Abstract

The growing demand for Large Language Models (LLMs) in applications such as content generation, intelligent chatbots, and sentiment analysis poses considerable challenges for LLM service providers. To efficiently use GPU resources and boost throughput, batching multiple requests has emerged as a popular paradigm; to further speed up batching, LLM quantization techniques reduce memory consumption and increase computing capacity. However, prevalent quantization schemes (e.g., 8-bit weight-activation quantization) cannot fully leverage the capabilities of modern GPUs, such as 4-bit integer operators, resulting in sub-optimal performance. To maximize LLMs' serving throughput, we introduce Atom, a low-bit quantization method that achieves high throughput improvements with negligible accuracy loss. Atom significantly boosts serving throughput by using low-bit operators and considerably reduces memory consumption via low-bit quantization. It attains high accuracy by applying a novel mixed-precision and fine-grained quantization process. We evaluate Atom on 4-bit weight-activation quantization setups in the serving context. Atom improves end-to-end throughput by up to 7.73times compared to the FP16 and by 2.53times compared to INT8 quantization, while maintaining the same latency target.

View arXiv page View PDF Add to collection

Community

fynnkroeger

Oct 31, 2023

This paper shows the tumbnail of the TeacherLM paper for me.

librarian-bot

Nov 3, 2023

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

jiangpq

Nov 23, 2023

The ARC-e and HellaSwag scores look weird.
Probably the author swapped them by mistake?

cylinbao

Paper author Nov 23, 2023

Hi @jiangpq , thanks for your interests in our work.
I'm one of the co-authors of this paper. For this table, we use lm-eval version 0.3.0 to get our results, which is the stable version we can download using pip. We are aware if we use a different version of lm-eval, then we can get a different accuracy results. However, the results in this table are evaluated under the same environment, so we believe it's still representative to capture each method's capacity.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2310.19102 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2310.19102 in a Space README.md to link it from this page.