-
Notifications
You must be signed in to change notification settings - Fork 467
proposal: support context cache for Improved conversation efficiency #1300
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
zhengkezhou1
wants to merge
3
commits into
vllm-project:main
Choose a base branch
from
zhengkezhou1:ep-1248
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,171 @@ | ||
# EP: Support Context Cache for Improved Conversation Efficiency | ||
|
||
## Background | ||
|
||
Many LLM providers (OpenAI, Anthropic) offer Prompt caching functionality, which reduces first token inference latency by caching prompts. We want to introduce similar functionality in our current system: Context Cache. | ||
Context Cache will be integrated with existing KV Cache. From a higher level perspective, it can be viewed as a KV Cache Manager that operates during the prefill stage. When Context Cache is enabled, if the prompt received by the inference engine already exists in the current KV Cache, we can skip the computation phase and instead load them from cache (disk, remote storage) into GPU memory. | ||
|
||
## Goal | ||
|
||
Provide Prompt Caching-like functionality for the current system. This is optional for users, and the usage is as follows: | ||
|
||
1. Users pass TTL (time-to-live) and the initial prompt to a specific endpoint to initialize the cache, and receive the corresponding response along with a unique identifier for the current session (session-id) | ||
2. Users send OpenAI-compatible requests (chat/completion, response) with the unique identifier (session-id). When cache hits occur, the corresponding response time will be shorter than without using it. | ||
3. The cache will be automatically deleted after the user-specified TTL, or users can proactively send requests to the endpoint to delete them early. | ||
|
||
## Implementation | ||
|
||
### Request Flow | ||
|
||
We will introduce a new endpoint: `/v1/context` to manage context caches. The following fields will be used: | ||
|
||
- `x-session-id`: A unique identifier for each context cache, created upon the first request and used in subsequent requests. | ||
- `x-session-ttl`: The time-to-live for the cache, after which it will be automatically cleared. | ||
|
||
Placing these two fields in the HTTP header ensures that all requests remain compatible with the OpenAI API | ||
|
||
#### Creating a Cache for a Session | ||
|
||
```mermaid | ||
sequenceDiagram | ||
participant C as Client | ||
participant E as Envoy | ||
participant G as Gateway Plugin | ||
participant R as Context Cache Manager | ||
participant T as Router | ||
participant V as vLLM Engine | ||
participant S as Persistent KV Cache Store | ||
|
||
Note over C,S: Creating a new Context Cache Session | ||
|
||
C->>+E: 1. POST /v1/context<br/>(x-session-ttl, prompt, model) | ||
E->>+G: 2. Forward Request | ||
G->>+R: 3. Generate x-session-id | ||
R->>+T: 4. Make routing decision (based on routing algorithm) | ||
T->>+V: 5. Inference request | ||
|
||
Note over V: Execute inference (Prefill + Decode) | ||
V->>V: 6. Execute inference | ||
V->>+S: 7. Offload cache to KV Cache storage | ||
V-->>-T: 8. Return output tokens | ||
T->>-R: 9. Create mapping between session-id and prompt cache<br/>e.g.: session-01 -> input tokens + output tokens | ||
R-->>-G: 10. Return response | ||
G-->>-E: 11. Pipe back response<br/>(with x-session-id) | ||
E-->>-C: 12. Complete Response | ||
``` | ||
|
||
Before using context caching, users first need to create it. Here, we create a context cache with a `x-session-ttl` of one hour. | ||
|
||
```shell | ||
curl -X POST http://localhost:8000/v1/context \ | ||
-H "Content-Type: application/json" \ | ||
-H "Authorization: Bearer test-key-1234567890" \ | ||
-H "x-session-ttl: 3600" \ | ||
-d '{ | ||
"model": "facebook-opt-125m", | ||
"prompt": "Say this is a test", | ||
}' | ||
``` | ||
|
||
In the response, we can obtain the unique identifier for the created session `x-session-id` in HTTP header. | ||
|
||
``` | ||
x-session-id: "session-01" | ||
``` | ||
|
||
```json | ||
{ | ||
"id": "cmpl-de1f99972bd34149968489cb100b2c88", | ||
"object": "text_completion", | ||
"created": 1752594611, | ||
"model": "facebook-opt-125m", | ||
... | ||
"usage": { | ||
"prompt_tokens": 6, | ||
"total_tokens": 93, | ||
"completion_tokens": 87, | ||
"prompt_tokens_details": null | ||
} | ||
} | ||
``` | ||
|
||
#### Using Context Cache with `x-session-id` | ||
|
||
```mermaid | ||
sequenceDiagram | ||
participant C as Client | ||
participant E as Envoy | ||
participant G as Gateway Plugin | ||
participant R as Context Cache Manager | ||
participant T as Router | ||
participant V as vLLM Engine | ||
participant S as Persistent KV Cache Store | ||
|
||
Note over C,S: Using existing session-id for inference request | ||
|
||
C->>+E: 1. POST /v1/completions<br/>(x-session-id, prompt, model...) | ||
E->>+G: 2. Forward Request | ||
G->>+R: 3. Look up cache corresponding to x-session-id | ||
R->>+T: 4. Make routing decision (based on routing algorithm) | ||
|
||
alt | ||
Note over R,S: ✅ Cache hit path: Use existing cache | ||
T->>R: 5a. Return inference engine pod metadata | ||
R->>+S: 6a. Load prompt caching from persistent cache<br/>to inference engine (pod) | ||
S-->>-V: 7a. Return prompt caching | ||
T->>+V: 8a. Inference request<br/>(inference engine will use prompt caching during prefill) | ||
else | ||
Note over T,V: ⚠️ Cache miss path: Execute full inference | ||
T->>+V: 5b. Inference request<br/>(inference engine will use entire prompt during prefill) | ||
Note over V,S: Generate new cache for subsequent use | ||
V->>S: 6b. Offload newly generated cache to KV Cache storage | ||
end | ||
|
||
Note over V: 🔄 Execute inference (Prefill + Decode) | ||
V-->>-T: 7. Return Output Tokens | ||
T->>-R: 8. Update mapping between session-id and prompt cache | ||
R-->>-G: 9. Return response | ||
G-->>-E: 10. Pipe back response<br/>(with x-session-id) | ||
E-->>-C: 11. Complete Response | ||
``` | ||
|
||
```shell | ||
curl -X POST http://localhost:8000/v1/completions \ | ||
-H "Content-Type: application/json" \ | ||
-H "Authorization: Bearer test-key-1234567890" \ | ||
-H "x-session-id: session-01" \ | ||
-d '{ | ||
"model": "facebook-opt-125m", | ||
"prompt": "Say this is a test" | ||
}' | ||
``` | ||
|
||
#### Clearing Context Cache | ||
|
||
When the TTL expires, the cache will be cleared. Manual early clearing is also provided. | ||
|
||
```shell | ||
curl -X DELETE http://localhost:8000/v1/context/$session_id \ | ||
-H "Content-Type: application/json" \ | ||
-H "Authorization: Bearer test-key-1234567890" \ | ||
``` | ||
zhengkezhou1 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
### Data Plane Change | ||
|
||
#### Introduce new plugin: Context Cache Manager | ||
|
||
We need to add a new plugin Context Cache Manager at the gateway layer, which is used to manage prompt caching across sessions. Traditional KV Cache in single inference sessions initializes attention states during prefill and applies them during the decode phase, then removes them from memory through eviction policies after inference completion. Context Cache Manager breaks through this limitation by pre-computing frequently reused prompt segments (such as system messages, document context, etc.) as prompt cache and persistently storing them within the specified TTL. When cache hits occur, the prefill phase transforms from recomputing attention states to loading pre-computed attention states from storage, achieving a conversion from compute-intensive to data-intensive operations, and reducing time-to-first-token latency. | ||
|
||
#### Interaction between Context Cache Manager and existing components | ||
|
||
Context Cache Manager (CCM) serves as the core coordination component, primarily interacting with the following components: | ||
|
||
1. **Bidirectional interaction with Router**: | ||
- CCM sends routing decision requests to Router | ||
- Router selects target inference engine based on routing algorithm and returns pod metadata to CCM | ||
- When cache hits occur, CCM loads prompt caching into the GPU memory of the pod selected by Router | ||
|
||
2. **Interaction with Persistent KV Cache Store**: | ||
- CCM manages cache lifecycle | ||
- Coordinates loading and offloading of cache between persistent storage and GPU memory | ||
- Maintains mapping relationships between session-id and cache data |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.