-
Notifications
You must be signed in to change notification settings - Fork 467
proposal: support context cache for Improved conversation efficiency #1300
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: zhengkezhou1 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @zhengkezhou1, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces a new context caching feature to the Aibrix system, designed to significantly improve the efficiency and reduce computational costs of multi-turn LLM conversations by reusing Key-Value (KV) Caches. It outlines a new API for managing these caches and details the architecture of a new Context Cache Manager component responsible for metadata and lifecycle management.
Highlights
- Context Caching Feature: Introduces an optional context caching mechanism for LLM inference in multi-turn conversations to reuse Key-Value (KV) Caches and reduce redundant computation, aiming to improve efficiency and reduce costs.
- New API Endpoint for Cache Management: Proposes a new
/v1/context
API endpoint for creating and deleting context caches. Each cache is identified by a uniquesession_id
and managed with attl
(time-to-live). - KV Cache Reuse in Completions: Details how existing
/v1/completions
requests can leverage asession_id
to load and utilize previously computed KV Cache data from the KV Cache Sidecar, significantly reducingprompt_tokens
in subsequent turns. - New Context Cache Manager Component: Defines a new
Context Cache Manager
component, intended to run within the Runtime Container. This manager is responsible for handling session metadata (session_id, TTL, KV Cache Sidecar references) and their lifecycle, but explicitly not the physical KV Cache data itself. - Architectural Request Flows: Provides detailed sequence diagrams illustrating the end-to-end request flows for both creating a new context cache and subsequently using an existing one, outlining interactions between the Client, Envoy, Gateway Plugin, Context Cache Manager, InferencePod, vLLM Main Container, and KV Cache Sidecar.
- Proposed Data Models and Interfaces: Includes Python
pydantic
models and aContextCacheManager
class structure, outlining the proposed API request/response formats and the internal interfaces for metadata management within the Context Cache Manager.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
The design document proposes a context caching feature to improve conversation efficiency by reusing KV caches. The separation of concerns between the Context Cache Manager and the KV Cache Sidecar is a good design choice. The review focuses on ensuring the KV cache is updated after each turn, clarifying request flows, and refining lifecycle management for robustness and predictability.
Hi @Jeffwan I've submitted this draft PR for the context caching feature proposal. As this is a preliminary design for a new feature, I'm eagerly looking forward to receiving valuable feedback on aspects such as the overall architectural design, API definitions, and integration approach with the existing KV Cache Sidecar. If you have a moment, please take a look. Thank you! |
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Zhengke Zhou <[email protected]>
@zhengkezhou1 thanks for driving the efforts. @happyandslow did you get a chance to review the proposal ? not sure if #633 cover similar features? |
@zhengkezhou1 BTW, I see this PR is still in progress. is it ready for review? If the PR scope is more on the proposal, please change the PR title to avoid confusion |
Yes, this is a proposal. I've already changed the title, and I'd appreciate any feedback to ensure I'm on the right track. |
@zhengkezhou1 please take a look my initial comments |
@zhengkezhou1 I am busy with v0.4.0 release, I will check your reply later today |
Sorry for late response. could you check such usage, If we need to do some necessary changes on the engine side, that's ok.
|
follow https://aibrix.readthedocs.io/latest/development/development.html#development testing on macOS first time inference takes 03:25
again same request, takes 0:00:34
|
I'm looking for more prompt caching info, so I'll update this proposal soon, maybe later this week? |
Signed-off-by: zhengkezhou1 <[email protected]>
Hi @Jeffwan, I've updated the proposal based on the paper at https://arxiv.org/abs/2311.04934 and other prompt caching resources. I'm a bit unclear on how the inference engine loads data from the AIBrix KVCache Offloading Framework. |
@zhengkezhou1 thanks for the update. I will check it out tomorrow. BTW, do you need to integrate with aibrix kvcache offloading framework? We can ask maintainer to give more context if needed. |
Yes, I believe so. In the scenario described at https://aibrix.readthedocs.io/latest/designs/aibrix-kvcache-offloading-framework.html#l2-distributed-kvcache-and-cross-engine-kv-reuse, if prompt caching is in remote storage rather than GPU memory, it needs to be loaded into the pod performing the prefill. Perhaps vLLM offers a ready-made method for this? I'm unsure about that point. |
My understanding would be this needs to be supported by prompt template. If we need to support this we need another PR. |
Pull Request Description
Problem: Current system lacks prompt caching capability, leading to repeated computation for frequently used prompts and higher time-to-first-token latency.
Solution: Introduce Context Cache that pre-computes and persists attention states for reusable prompt segments (system messages, document context) beyond single inference sessions.
Key Changes:
/v1/context
endpoint with session management (x-session-id
,x-session-ttl
)Benefits:
API Usage:
POST /v1/context
- Initialize cache with TTLPOST /v1/completions
withx-session-id
- Use cached contextDELETE /v1/context/{session_id}
- Manual cleanupFlow diagram
Creating a Cache for a Session
Using Context Cache with
x-session-id
Related Issues
Resolves: #1248