[P/D] [NixlConnector] Draft: improved failure handling + context propagation #26171

wseaton · 2025-10-03T14:30:06Z

Purpose

This integrates nixl_connector with additional scheduler features exposed in #19330 for retrying requests that have failed blocks, also handles and propogates nixl exceptions that previously would result in crash of the decode server even if transient.

Features:

Allow for fine grained configuration of behavior, as there are cases (like DBO) where we basically never want to attempt to prefill locally on a decode worker
In the must-fail case, we need to propogate a request abort all the way to the api server so we can throw a semanitcally meaningful error. The existing finish_reason=abort is not technically spec compliant, so this has been changed to throw a 500, signaling the request should be retried. These changes could be cherrypicked into another PR.
Allow for auto, which is a strategy that uses a (currently hardcoded) num tokens to decide whether or not to do local prefill or abort

Test Plan

For integration testing, tested injecting faults using a vllm process instrumented with https://github.com/wseaton/ucx-fault-injector/, which forces nixl exceptions to be thrown during transfer.

Test Result

Working load failure recovery for transfer initiation failures. Working on triggering failures during block read/notifications.

mergify · 2025-10-03T14:30:43Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @wseaton.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

gemini-code-assist

Code Review

This pull request introduces failure recovery mechanisms for the NIXL KV connector. It adds error handling for transfer initiation failures and failures during block reads. When a failure occurs, the affected KV cache blocks are marked as invalid, and this information is propagated to the scheduler for retrying the request. The changes also include adding statistics for failed transfers and notifications, and rate-limiting for some log messages to prevent spam.

The overall approach is sound and significantly improves the robustness of the NIXL connector. However, I've found a critical issue where failed blocks are not reported correctly when use_host_buffer is disabled, which would prevent failure recovery in that configuration. I've left a comment with details on the issue and a suggested fix.

gemini-code-assist · 2025-10-03T14:32:35Z

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

+                    meta = self._recving_metadata.get(req_id)
+                    if meta:
+                        self._invalid_block_ids.update(meta.local_block_ids)


There's a potential issue here. self._recving_metadata is only populated when self.use_host_buffer is true (in start_load_kv). If use_host_buffer is false, meta will be None, and _invalid_block_ids will not be updated on transfer failure. This means failed blocks won't be reported for retry, which undermines the goal of this PR.

To fix this, _recving_metadata should be populated for all receiving requests, regardless of use_host_buffer. This would likely involve removing the if self.use_host_buffer: condition in start_load_kv.

After that change, you'll also need to ensure _recving_metadata is cleaned up for successful requests when use_host_buffer is false, probably in get_finished, to prevent a memory leak.

Signed-off-by: Will Eaton <[email protected]>

wseaton requested review from WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac, alexm-redhat, heheda12345, ApostaC and NickLucche as code owners October 3, 2025 14:30

mergify bot added the v1 label Oct 3, 2025

mergify bot added needs-rebase kv-connector labels Oct 3, 2025

gemini-code-assist bot reviewed Oct 3, 2025

View reviewed changes

wseaton requested review from aarnphm and chaunceyjiang as code owners October 3, 2025 19:35

mergify bot added the frontend label Oct 3, 2025

wseaton changed the title ~~[P/D] [NixlConnecotr] Draft: add KV load failure recovery to nixl connector~~ [P/D] [NixlConnector] Draft: failure handling + context propogation Oct 4, 2025

wseaton changed the title ~~[P/D] [NixlConnector] Draft: failure handling + context propogation~~ [P/D] [NixlConnector] Draft: improved failure handling + context propagation Oct 4, 2025

load failure handling + error context propagation

bfd1f52

Signed-off-by: Will Eaton <[email protected]>

wseaton force-pushed the nixl-failure-recovery branch from 035e54b to bfd1f52 Compare October 6, 2025 13:46

mergify bot removed the needs-rebase label Oct 6, 2025

wseaton added 2 commits October 6, 2025 09:58

precommit

6bd1517

Signed-off-by: Will Eaton <[email protected]>

clean up error type handling

1cec065

Signed-off-by: Will Eaton <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[P/D] [NixlConnector] Draft: improved failure handling + context propagation #26171

[P/D] [NixlConnector] Draft: improved failure handling + context propagation #26171

wseaton commented Oct 3, 2025 •

edited by github-actions bot

Loading

Uh oh!

mergify bot commented Oct 3, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Oct 3, 2025

Uh oh!

Uh oh!

Uh oh!

[P/D] [NixlConnector] Draft: improved failure handling + context propagation #26171

Are you sure you want to change the base?

[P/D] [NixlConnector] Draft: improved failure handling + context propagation #26171

Conversation

wseaton commented Oct 3, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

mergify bot commented Oct 3, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

wseaton commented Oct 3, 2025 •

edited by github-actions bot

Loading