Papers
arxiv:2406.17565

MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool

Published on Jun 25
· Submitted by lastweek on Jun 27
Authors:
,
,
,
,
,
,
,
,

Abstract

Large language model (LLM) serving has transformed from stateless to stateful systems, utilizing techniques like context caching and disaggregated inference. These optimizations extend the lifespan and domain of the KV cache, necessitating a new architectural approach. We present MemServe, a unified system that integrates both inter-request and intra-request optimizations. MemServe introduces MemPool, an elastic memory pool managing distributed memory and KV caches across serving instances. Using MemPool APIs, MemServe combines context caching with disaggregated inference for the first time, supported by a global scheduler that enhances cache reuse through a global prompt tree-based locality-aware policy. Tests show that MemServe significantly improves job completion time and time-to-first-time.

Community

Paper author Paper submitter

We propose Memory-enhanced model Serving,
or MemServe, to handle inter-request and intra-request optimizations within a unified system. To tackle the challenges of managing the KV cache across distributed instances, MemServe introduces an elastic memory pool, or MemPool, which is a substrate for managing all cluster memory, including
CPU DRAM and GPU HBM. MemPool offers a rich set of APIs for managing distributed memory and KV cache. Utilizing these APIs, MemServe implements context caching over standard prefill-decode-colocated (PD-colocated) instances and disaggregated inference. Moreover, MemServe enhances disaggregated inference with context caching, reaping both benefits. Finally, to maximize KV cache reuse, MemServe employs a global scheduler that incorporates a locality-aware policy using novel global prompt trees for best-effort routing.

memserve-overview.png

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2406.17565 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2406.17565 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2406.17565 in a Space README.md to link it from this page.

Collections including this paper 1