Skip to content

Conversation

wseaton
Copy link
Contributor

@wseaton wseaton commented Oct 3, 2025

Purpose

This integrates nixl_connector with additional scheduler features exposed in #19330 for retrying requests that have failed blocks, also handles and propogates nixl exceptions that previously would result in crash of the decode server even if transient.

Features:

  1. Allow for fine grained configuration of behavior, as there are cases (like DBO) where we basically never want to attempt to prefill locally on a decode worker
  2. In the must-fail case, we need to propogate a request abort all the way to the api server so we can throw a semanitcally meaningful error. The existing finish_reason=abort is not technically spec compliant, so this has been changed to throw a 500, signaling the request should be retried. These changes could be cherrypicked into another PR.
  3. Allow for auto, which is a strategy that uses a (currently hardcoded) num tokens to decide whether or not to do local prefill or abort

Test Plan

For integration testing, tested injecting faults using a vllm process instrumented with https://github.com/wseaton/ucx-fault-injector/, which forces nixl exceptions to be thrown during transfer.

Test Result

Working load failure recovery for transfer initiation failures. Working on triggering failures during block read/notifications.

Copy link

mergify bot commented Oct 3, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @wseaton.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces failure recovery mechanisms for the NIXL KV connector. It adds error handling for transfer initiation failures and failures during block reads. When a failure occurs, the affected KV cache blocks are marked as invalid, and this information is propagated to the scheduler for retrying the request. The changes also include adding statistics for failed transfers and notifications, and rate-limiting for some log messages to prevent spam.

The overall approach is sound and significantly improves the robustness of the NIXL connector. However, I've found a critical issue where failed blocks are not reported correctly when use_host_buffer is disabled, which would prevent failure recovery in that configuration. I've left a comment with details on the issue and a suggested fix.

Comment on lines 1211 to 1213
meta = self._recving_metadata.get(req_id)
if meta:
self._invalid_block_ids.update(meta.local_block_ids)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There's a potential issue here. self._recving_metadata is only populated when self.use_host_buffer is true (in start_load_kv). If use_host_buffer is false, meta will be None, and _invalid_block_ids will not be updated on transfer failure. This means failed blocks won't be reported for retry, which undermines the goal of this PR.

To fix this, _recving_metadata should be populated for all receiving requests, regardless of use_host_buffer. This would likely involve removing the if self.use_host_buffer: condition in start_load_kv.

After that change, you'll also need to ensure _recving_metadata is cleaned up for successful requests when use_host_buffer is false, probably in get_finished, to prevent a memory leak.

@mergify mergify bot added the frontend label Oct 3, 2025
@wseaton wseaton changed the title [P/D] [NixlConnecotr] Draft: add KV load failure recovery to nixl connector [P/D] [NixlConnector] Draft: failure handling + context propogation Oct 4, 2025
@wseaton wseaton changed the title [P/D] [NixlConnector] Draft: failure handling + context propogation [P/D] [NixlConnector] Draft: improved failure handling + context propagation Oct 4, 2025
@wseaton wseaton force-pushed the nixl-failure-recovery branch from 035e54b to bfd1f52 Compare October 6, 2025 13:46
@mergify mergify bot removed the needs-rebase label Oct 6, 2025
Signed-off-by: Will Eaton <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant