-
-
Notifications
You must be signed in to change notification settings - Fork 10.5k
Description
Motivation.
🧩 Summary
This RFC proposes adding fault recovery support to the KV connector infrastructure introduced in #15960. The goal is to make vLLM resilient to KV block load failures (e.g., due to eviction or network disconnections) by detecting failures and rescheduling only affected requests for recomputation.
🧠 Background
vLLM recently introduced the KV connector abstraction, providing a pluggable interface for integrating external solutions for KV cache offload and transfer. This enables supporting use cases like KV cache sharing and prefill-decode (P-D) disaggregation.
Notable examples include the LMCache connector (PR #16625) and NIXL connector (PR #17751).
The KV loading protocol involves:
- start_load_kv() → invoked before the forward pass to initiate (asynchronous) loading of all required KV blocks.
- wait_for_layer_load() → invoked before each layer’s attention kernel to ensure the relevant blocks are available.
⚠️ Problem
Currently, vLLM lacks a mechanism to detect or recover from KV load failures. If one or more KV blocks fail to load, it may lead to system instability (crash or hang) or incorrect outputs (silent corruption of the KV cache). This undermines reliability, especially in large-scale or disaggregated deployments.
Proposed Change.
Even if KV load failures occur, the forward pass should complete. Post-execution, vLLM will:
- Detect failed KV blocks, and
- Reschedule only the affected requests for recomputation, using the longest valid prefix.
💡 Key Considerations:
- Preserve progress for valid blocks in affected requests and for unaffected requests in the same batch.
- Prevent deadlocks in tensor-parallel setups (e.g., all-reduce ops).
- In some connector implementations attention kernels may have already been launched and cannot be "rolled back."
🛠️ Suggested Implementation
-
KV Connector API Extension
- Add
get_block_ids_with_load_errors()
toKVConnectorBase_V1
to allow connectors to report failed block IDs.
- Add
-
Executor Reporting
GPUModelRunner
retrieves the failed block IDs and includes them inModelRunnerOutput
(should be aggregated across all workers in TP setups).
-
Scheduler Recovery Logic
-
Detects affected requests that used failed blocks
-
Truncates the number of computed blocks for affected requests to their longest valid prefix
-
Reschedules affected requests for recomputation
-
Feedback Period.
No response
CC List.
No response
Any Other Things.
🔮 Future Enhancements
- Skip attention computations that depend on failed KV blocks (early detection)
- Make recovery behavior configurable (e.g., allow request failure instead of rescheduling for use cases like disaggregated P-D)
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.