arxiv:2312.05516

Stateful Large Language Model Serving with Pensieve

Published on Dec 9, 2023

Authors:

Lingfan Yu ,

Abstract

Large Language Models (LLMs) have recently experienced great success, as evident in the widespread popularity of ChatGPT. Existing LLM serving systems are stateless across requests. Consequently, when LLMs are used in the common setting of multi-turn conversations, a growing log of the conversation history must be processed alongside any request by the serving system at each turn, resulting in repeated history processing. In this paper, we design Pensieve, a system optimized for multi-turn conversation LLM serving. Pensieve maintains the conversation state across requests by caching previously processed history to avoid duplicate processing. Pensieve's multi-tier caching strategy can utilize both GPU and CPU memory to efficiently store and retrieve cached data. Pensieve also generalizes the recent PagedAttention kernel to support attention between multiple input tokens with a GPU cache spread over non-contiguous memory. Our evaluation shows that Pensieve is able to achieve 1.51-1.95x throughput compared to vLLM and reduce latency by 60-75%.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2312.05516 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2312.05516 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2312.05516 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.