Monday, October 27, 2025

Meet ‘kvcached’: A Machine Studying Library to Allow Virtualized, Elastic KV Cache for LLM Serving on Shared GPUs


Massive language mannequin serving usually wastes GPU reminiscence as a result of engines pre-reserve giant static KV cache areas per mannequin, even when requests are bursty or idle. Meet ‘kvcached‘, a library to allow virtualized, elastic KV cache for LLM serving on shared GPUs. kvcached has been developed by a analysis from Berkeley’s Sky Computing Lab (College of California, Berkeley) in shut collaboration with Rice College and UCLA, and with invaluable enter from collaborators and colleagues at NVIDIA, Intel Company, Stanford College. It introduces an OS-style digital reminiscence abstraction for the KV cache that lets serving engines reserve contiguous digital house first, then again solely the lively parts with bodily GPU pages on demand. This decoupling raises reminiscence utilization, reduces chilly begins, and permits a number of fashions to time share and house share a tool with out heavy engine rewrites.

https://github.com/ovg-project/kvcached

What kvcached modifications?

With kvcached, an engine creates a KV cache pool that’s contiguous within the digital deal with house. As tokens arrive, the library maps bodily GPU pages lazily at a superb granularity utilizing CUDA digital reminiscence APIs. When requests full or fashions go idle, pages unmap and return to a shared pool, which different colocated fashions can instantly reuse. This preserves easy pointer arithmetic in kernels, and removes the necessity for per engine consumer stage paging. The mission targets SGLang and vLLM integration, and it’s launched below the Apache 2.0 license. Set up and a one command fast begin are documented within the Git repository.

https://yifanqiao.notion.website/Clear up-the-GPU-Price-Disaster-with-kvcached-289da9d1f4d68034b17bf2774201b141

How does it affect at scale?

Manufacturing workloads host many fashions with lengthy tail visitors and spiky bursts. Static reservations depart reminiscence stranded and decelerate time to first token when fashions should be activated or swapped. The Prism analysis paper reveals that multi-LLM serving requires cross mannequin reminiscence coordination at runtime, not simply compute scheduling. Prism implements on demand mapping of bodily to digital pages and a two stage scheduler, and experiences greater than 2 instances price financial savings and 3.3 instances greater TTFT SLO attainment versus prior programs on actual traces. kvcached focuses on the reminiscence coordination primitive, and gives a reusable part that brings this functionality to mainstream engines.

https://www.arxiv.org/pdf/2505.04021

Efficiency indicators

The kvcached crew experiences 1.2 instances to twenty-eight instances sooner time to first token in multi mannequin serving, resulting from fast reuse of freed pages and the removing of enormous static allocations. These numbers come from multi-LLM eventualities the place activation latency and reminiscence headroom dominate tail latency. The analysis crew word kvcached’s compatibility with SGLang and vLLM, and describe elastic KV allocation because the core mechanism.

https://yifanqiao.notion.website/Clear up-the-GPU-Price-Disaster-with-kvcached-289da9d1f4d68034b17bf2774201b141

Current work has moved from mounted partitioning to digital reminiscence based mostly strategies for KV administration. Prism extends VMM based mostly allocation to multi-LLM settings with cross mannequin coordination and scheduling. Prior efforts like vAttention discover CUDA VMM for single mannequin serving to keep away from fragmentation with out PagedAttention. The arc is evident, use digital reminiscence to maintain KV contiguous in digital house, then map bodily pages elastically because the workload evolves. kvcached operationalizes this concept as a library, which simplifies adoption inside current engines.

https://www.arxiv.org/pdf/2505.04021

Sensible Purposes for Devs

Colocation throughout fashions: Engines can colocate a number of small or medium fashions on one machine. When one mannequin goes idle, its KV pages free rapidly and one other mannequin can increase its working set with out restart. This reduces head of line blocking throughout bursts and improves TTFT SLO attainment.

Activation conduct: Prism experiences activation instances of about 0.7 seconds for an 8B mannequin and about 1.5 seconds for a 70B mannequin with streaming activation. kvcached advantages from comparable rules as a result of digital reservations permit engines to arrange deal with ranges prematurely, then map pages as tokens arrive.

Autoscaling for serverless LLM: Superb grained web page mapping makes it possible to scale replicas extra incessantly and to run chilly fashions in a heat state with minimal reminiscence footprint. This allows tighter autoscaling loops and reduces the blast radius of scorching spots.

Offloading and future work. Digital reminiscence opens the door to KV offload to host reminiscence or NVMe when the entry sample permits it. NVIDIA’s latest information on managed reminiscence for KV offload on GH200 class programs reveals how unified deal with areas can lengthen capability at acceptable overheads. The kvcached maintainers additionally focus on offload and compaction instructions in public threads. Confirm throughput and latency in your personal pipeline, since entry locality and PCIe topology have sturdy results.

https://www.arxiv.org/pdf/2505.04021

Key Takeaways

  1. kvcached virtualizes the KV cache utilizing GPU digital reminiscence, engines reserve contiguous digital house and map bodily pages on demand, enabling elastic allocation and reclamation below dynamic hundreds.
  2. It integrates with mainstream inference engines, particularly SGLang and vLLM, and is launched below Apache 2.0, making adoption and modification easy for manufacturing serving stacks.
  3. Public benchmarks report 1.2 instances to twenty-eight instances sooner time to first token in multi mannequin serving resulting from fast reuse of freed KV pages and the removing of enormous static reservations.
  4. Prism reveals that cross mannequin reminiscence coordination, applied through on demand mapping and two stage scheduling, delivers greater than 2 instances price financial savings and three.3 instances greater TTFT SLO attainment on actual traces, kvcached provides the reminiscence primitive that mainstream engines can reuse.
  5. For clusters that host many fashions with bursty, lengthy tail visitors, virtualized KV cache permits secure colocation, sooner activation, and tighter autoscaling, with reported activation round 0.7 seconds for an 8B mannequin and 1.5 seconds for a 70B mannequin within the Prism analysis.

kvcached is an efficient method towards GPU reminiscence virtualization for LLM serving, not a full working system, and that readability issues. The library reserves digital deal with house for the KV cache, then maps bodily pages on demand, which permits elastic sharing throughout fashions with minimal engine modifications. This aligns with proof that cross mannequin reminiscence coordination is important for multi mannequin workloads and improves SLO attainment and value below actual traces. Total, kvcached advances GPU reminiscence coordination for LLM serving, manufacturing worth relies on per cluster validation.


Take a look at the GitHub Repo, Paper 1, Paper 2 and Technical particulars. Be at liberty to take a look at our GitHub Web page for Tutorials, Codes and Notebooks. Additionally, be happy to comply with us on Twitter and don’t overlook to hitch our 100k+ ML SubReddit and Subscribe to our Publication. Wait! are you on telegram? now you may be part of us on telegram as properly.


Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is dedicated to harnessing the potential of Synthetic Intelligence for social good. His most up-to-date endeavor is the launch of an Synthetic Intelligence Media Platform, Marktechpost, which stands out for its in-depth protection of machine studying and deep studying information that’s each technically sound and simply comprehensible by a large viewers. The platform boasts of over 2 million month-to-month views, illustrating its reputation amongst audiences.

Related Articles

Latest Articles