Skip to content

[RFC]: Graceful Error Handling for KV Connector Load Failures #19329

@sdavidbd

Description

@sdavidbd

Motivation.

🧩 Summary

This RFC proposes adding fault recovery support to the KV connector infrastructure introduced in #15960. The goal is to make vLLM resilient to KV block load failures (e.g., due to eviction or network disconnections) by detecting failures and rescheduling only affected requests for recomputation.

🧠 Background

vLLM recently introduced the KV connector abstraction, providing a pluggable interface for integrating external solutions for KV cache offload and transfer. This enables supporting use cases like KV cache sharing and prefill-decode (P-D) disaggregation.
Notable examples include the LMCache connector (PR #16625) and NIXL connector (PR #17751).

The KV loading protocol involves:

  • start_load_kv() → invoked before the forward pass to initiate (asynchronous) loading of all required KV blocks.
  • wait_for_layer_load() → invoked before each layer’s attention kernel to ensure the relevant blocks are available.

⚠️ Problem

Currently, vLLM lacks a mechanism to detect or recover from KV load failures. If one or more KV blocks fail to load, it may lead to system instability (crash or hang) or incorrect outputs (silent corruption of the KV cache). This undermines reliability, especially in large-scale or disaggregated deployments.


Proposed Change.

Even if KV load failures occur, the forward pass should complete. Post-execution, vLLM will:

  • Detect failed KV blocks, and
  • Reschedule only the affected requests for recomputation, using the longest valid prefix.

💡 Key Considerations:

  • Preserve progress for valid blocks in affected requests and for unaffected requests in the same batch.
  • Prevent deadlocks in tensor-parallel setups (e.g., all-reduce ops).
  • In some connector implementations attention kernels may have already been launched and cannot be "rolled back."

🛠️ Suggested Implementation

  1. KV Connector API Extension

    • Add get_block_ids_with_load_errors() to KVConnectorBase_V1 to allow connectors to report failed block IDs.
  2. Executor Reporting

    • GPUModelRunner retrieves the failed block IDs and includes them in ModelRunnerOutput (should be aggregated across all workers in TP setups).
  3. Scheduler Recovery Logic

    • Detects affected requests that used failed blocks

    • Truncates the number of computed blocks for affected requests to their longest valid prefix

    • Reschedules affected requests for recomputation


Feedback Period.

No response

CC List.

No response

Any Other Things.

🔮 Future Enhancements

  • Skip attention computations that depend on failed KV blocks (early detection)
  • Make recovery behavior configurable (e.g., allow request failure instead of rescheduling for use cases like disaggregated P-D)

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    RFCstaleOver 90 days of inactivity

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions