[RFC]: Graceful Error Handling for KV Connector Load Failures

### Motivation.

### 🧩 Summary  
This RFC proposes adding fault recovery support to the KV connector infrastructure introduced in #15960. The goal is to make vLLM resilient to KV block load failures (e.g., due to eviction or network disconnections) by detecting failures and rescheduling only affected requests for recomputation.

### 🧠 Background  
vLLM recently introduced the KV connector abstraction, providing a pluggable interface for integrating external solutions for KV cache offload and transfer. This enables supporting use cases like KV cache sharing and prefill-decode (P-D) disaggregation.
Notable examples include the LMCache connector (PR #16625) and NIXL connector (PR #17751).

The KV loading protocol involves:
- start_load_kv() → invoked before the forward pass to initiate (asynchronous) loading of all required KV blocks.
- wait_for_layer_load() → invoked before each layer’s attention kernel to ensure the relevant blocks are available.

### ⚠️ Problem  
Currently, vLLM lacks a mechanism to detect or recover from KV load failures. If one or more KV blocks fail to load, it may lead to system instability (crash or hang) or incorrect outputs (silent corruption of the KV cache). This undermines reliability, especially in large-scale or disaggregated deployments.

---

### Proposed Change.

Even if KV load failures occur, the forward pass should complete. Post-execution, vLLM will:

- **Detect failed KV blocks**, and
- **Reschedule only the affected requests** for recomputation, using the longest valid prefix.

#### 💡 Key Considerations:
- Preserve progress for valid blocks in affected requests and for unaffected requests in the same batch.
- Prevent deadlocks in tensor-parallel setups (e.g., all-reduce ops).
- In some connector implementations attention kernels may have already been launched and cannot be "rolled back."

### 🛠️ Suggested Implementation
1. KV Connector API Extension
    - Add `get_block_ids_with_load_errors()` to `KVConnectorBase_V1` to allow connectors to report failed block IDs.

2. Executor Reporting
    - `GPUModelRunner` retrieves the failed block IDs and includes them in `ModelRunnerOutput` (should be aggregated across all workers in TP setups).

3. Scheduler Recovery Logic

    - Detects affected requests that used failed blocks

    - Truncates the number of computed blocks for affected requests to their longest valid prefix

    - Reschedules affected requests for recomputation

---

### Feedback Period.

_No response_

### CC List.

_No response_

### Any Other Things.

### 🔮 Future Enhancements
- Skip attention computations that depend on failed KV blocks (early detection)
- Make recovery behavior configurable (e.g., allow request failure instead of rescheduling for use cases like disaggregated P-D)


### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[RFC]: Graceful Error Handling for KV Connector Load Failures #19329

Motivation.

🧩 Summary

🧠 Background

⚠️ Problem

Proposed Change.

💡 Key Considerations:

🛠️ Suggested Implementation

Feedback Period.

CC List.

Any Other Things.

🔮 Future Enhancements

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[RFC]: Graceful Error Handling for KV Connector Load Failures #19329

Description

Motivation.

🧩 Summary

🧠 Background

⚠️ Problem

Proposed Change.

💡 Key Considerations:

🛠️ Suggested Implementation

Feedback Period.

CC List.

Any Other Things.

🔮 Future Enhancements

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions