Skip to content

[RFC]: Generalized KV cache reuse #25950

@iddo10

Description

@iddo10

Motivation.

🚀 The feature, motivation and pitch

Summary
This RFC proposes enabling the reuse of the KV cache for any subset of tokens, rather than restricting reuse to prefix-complete tokens.

With the recently introduced KV cache connector abstraction (#15960) enabling retrieval from external storage, perforated KV caches naturally arise in real workloads. Supporting reuse in these cases avoids unnecessary recomputation and improves runtime efficiency.

Looking ahead, generalized KV cache reuse unlocks richer capabilities: enabling reuse of KV entries for independent token segments across contexts — for example, in retrieval-augmented generation (RAG).

This work should not be confused with sparse KV cache (#5751, #21772), where tokens are excluded from the attention computation. Here, all tokens may still participate in the computation; we simply enable reuse of even partially existing KV entries.

Background
In vLLM, efficient token generation relies heavily on KV cache reuse. The KV cache is generated and stored in the local GPU memory. When KV cache exists in GPU memory, it will always be prefix aligned. However, when external storage is used, with the intoduction of the KV connector, this is no longer a guarantee, and the KV cache can be perforated, meaning some entries may be missing, depending on the granularity and eviction policy of the external storage.

Currently, vLLM handles cache reuse by enforcing strict prefix-complete cache context, which means that even if we have perforated KVcache, we will still use the longest sequential prefix, ignoring later token’s cache. This can be very inefficient, and wastful.

Example:

Current behavior (prefix-complete reuse):
Example 1: Prefix-complete KV cache
Prompt:                        [A][B][C][D][E]
KV cache (local + external):   [A][B][C]      
=>
Reuse:   [A][B][C]
Compute:          [D][E]
Example 2: Non-prefix-complete KV cache (e.g., only B, D available):
Prompt:                        [A][B][C][D][E]
KV cache (local + external):      [B]   [D]
=>
Reuse:   ✗
Compute: [A][B][C][D][E]
Proposed behavior (generalized reuse):
Prompt:                        [A][B][C][D][E]
KV cache (local + external):      [B]   [D]
=>
Reuse:         [B]   [D]

Motivation
We claim that non-prefix-complete cache can occur not only in edge-cases, but is actually pretty common.

First, lets address the simplest case - a temporal disconnection of a link to the external storage during KV cache loading. This may cause a non-prefix-complete KV cache load. In PR #19330 the solution proposed was to redo the affected requests, reusing only the valid prefixes. Here we suggest reusing ALL successfully loaded KV cache blocks instead of recomputing them.

Arguably, these types of errors are rare, and hence may not justify an effort handling them effectively. However, the following described case shows there is value for our proposed feature -

Consider using external diasggregated storage, where the KV cache is distributed evenly and randomly accross multiple storage nodes without High Availability, see figure below. Then, if one node crashes, we can still utilize the rest of the cache, which is stored in still-alive nodes, but no longer prefix-complete. It releases the storage solution from implementing complex mechanisms for handling high-availability issues.

Image

Broader Impact
Beyond the scenarios mentioned above, supporting generalized KV cache reuse lays the foundation for richer capabilities - reuse of independent cache segments.

We will refer to this phenomena as “compositional context”, where each segment’s KV cache is independent from other contexts, but can still be used in the context of a larger prompt that contains it.

Note that for compositional context, a special token will have to be generated so that the token’s key can be reffered to by its internal context alone, whereas in the general case, a key depends on the entire context. This is a task the KV connector will have to take.

For example, say we have 4 token-segments as a prefix to a given new prompt, where each has an independent computed KV cache (self attention):
Image
Ordinarily, this entire stream (chunks + prompt) will be fully recomputed (aside from chunk1), as follows (below we show attention computation between chunk4 and chunk2, and the same goes for all chunk pairs):
Image
However, if a chunk’s inner-attention holds most of the attention-weight, we can re-use it without quality degredation, hence, breaking the sequentiality demand by VLLM. This can be done per chunk, or ever per token (which is the premise of CacheBlend https://arxiv.org/abs/2405.16444).

By treating tokens independently when a full sequential context cannot be guaranteed we can:

  1. Reduce recomputation for non-contiguous attention patterns and replace with KV cache loading
  2. Improve efficiency for retrieval-augmented generation (RAG) where inserted chunks break strict token order but large segments remain reusable.
  3. Improve resilience to IO errors or packet loss scenarios in real-time inference, where missing tokens no longer force full recomputation.

Insight
Note that when we want to recompute a token within the prefix because its KV vectors are missing, it is actually not in the cache dimension, but in the query dimension.
Meaning, after reconstructing the KV cache (using qkv_proj() - which is a part of the forward flow, or from the cache), we calculate the attention only for queries that couldnt locate their KV vectors from the cache, and require the recalculation of the layer-output. We name the queries that require recalculation as query-hole.

Also, this feature can work either per token (meaning, recompute all layers if needed), or per layer-per token (meaning, we can stop the computation in case KV cache exists in all layers above some layer L).

Proposal
To support the feature, we need to modify the attention kernels, the KV connector and the scheduler. This change can be made opt-in via a runtime config flag (e.g., --enable-perforated-cache-handling), for controlled rollout. Note that FLASH_INFER’s backend doesnt need any change, as it supports masking.

Attention Kernel
To allow the attention kernel to control which query q is multiplied by which keys k, a masking parameter is required (this is already supported in the FLASH_INFER backend).

By default, attention kernels receive a batch of queries Q, assume they are sequential, and compute attention against all previous keys K. With masking, we can still send the full set of queries Q, but constrain each query so it only attends to a specific subset of keys K'.

For example say we have 6 tokens, only some require recompute:

Image In this figure we address the green tokens as those who have the KV cached stored. red requires recompute.

Subsequently, the scheduler will pass tokens 2,4 and 5 to the attention flow, but will be referred as tokens' 4,5,6, assuming 1,2 and 3 are cached. Hence, in order keep the attention flow coherent, the scheduler will also pass a masking array per token, as follows

Image In this figure we show how the kernel will see the input - queries and masks

The masking basically tells the kernel to multiply a specific token with some of the keys. for instance, token5', which is in fact toekn3, should be multiplied by keys of tokens 1 and 2.

This enables two things:

Recomputation for non-sequential queries (queries that don’t follow a strict order).

Context-length restrictions, e.g. ensuring that query 100 only attends to keys 80–99 rather than the entire key set.

KV connector
At present, the KV connector receives a number representing the prefix length already stored in GPU memory. It then attempts to fetch additional KV data from external storage and returns the count of sequential tokens it was able to retrieve (get_num_new_matched_tokens).

To support the proposed change, the KV connector should instead return a success bitmap, indicating for each KV vector whether it was found (1) or not (0), regardless of the reason. this bitmap will then be passed to the scheduler for handling.

Optionally, the design could be extended to a matrix representation, providing the same indication at a finer granularity (per token × per layer).

Scheduler
The scheduler is in charge of informing the KV connector what to fetch from the extrernal storage, and also of running the inference flow with the relevant information.

The required change is as follows

modify the glue code using the modified connector API

send only relevant hidden vectors to the model’s forward function

communicate the appropriate masking array to the model’s forward function, meaning, which key and value vectors should be multiplied by each query

Impact
Improves flexibility of the scheduling and runtime execution model

Minimal changes required to the existing APIs

May complicate some backend logic (e.g., attention kernels), depending on hardware specifics. we would want the kernels to be able to work with masking maps accross the KV cache

Management overhead may acceed KV cache reuse, if the number of hole-tokens is high (so we would prefer recompute the entire prompt). The solution will depend on the storage vs. gpus, and a threshold must be configured

Implementation Plan
Add KV cache load-fail of tokens detection logic - This utilizes the suggestion defined in #19329, in which the KVConnector returns the failed blocks

  1. Prototype a fallback scheduler mode for independent token execution
  2. Utilize FLASH_INFER API for for first Gen implementation
  3. Expand to all beckends
  4. Enable compositional context - add special tokens for key generation of independent segments.
  5. Modify attention kernel interface to accept per-token attention lengths/masks

Related Discussions

  1. [RFC]: Graceful Error Handling for KV Connector Load Failures - [RFC]: Graceful Error Handling for KV Connector Load Failures #19329
  2. discussion on KV connector APIKVConnector API Evolution
  3. Potential relevance to speculative decoding and prefix caching strategies

Alternatives

No response

Additional context

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Proposed Change.

we propose a change so that the attention kernels can handle non-sequential token-queries, and sparse KV vectors per token.

Feedback Period.

No response

CC List.

No response

Any Other Things.

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions