-
-
Notifications
You must be signed in to change notification settings - Fork 10.5k
[P/D] [NixlConnector] Draft: improved failure handling + context propagation #26171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces failure recovery mechanisms for the NIXL KV connector. It adds error handling for transfer initiation failures and failures during block reads. When a failure occurs, the affected KV cache blocks are marked as invalid, and this information is propagated to the scheduler for retrying the request. The changes also include adding statistics for failed transfers and notifications, and rate-limiting for some log messages to prevent spam.
The overall approach is sound and significantly improves the robustness of the NIXL connector. However, I've found a critical issue where failed blocks are not reported correctly when use_host_buffer
is disabled, which would prevent failure recovery in that configuration. I've left a comment with details on the issue and a suggested fix.
meta = self._recving_metadata.get(req_id) | ||
if meta: | ||
self._invalid_block_ids.update(meta.local_block_ids) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a potential issue here. self._recving_metadata
is only populated when self.use_host_buffer
is true (in start_load_kv
). If use_host_buffer
is false, meta
will be None
, and _invalid_block_ids
will not be updated on transfer failure. This means failed blocks won't be reported for retry, which undermines the goal of this PR.
To fix this, _recving_metadata
should be populated for all receiving requests, regardless of use_host_buffer
. This would likely involve removing the if self.use_host_buffer:
condition in start_load_kv
.
After that change, you'll also need to ensure _recving_metadata
is cleaned up for successful requests when use_host_buffer
is false, probably in get_finished
, to prevent a memory leak.
Signed-off-by: Will Eaton <[email protected]>
035e54b
to
bfd1f52
Compare
Signed-off-by: Will Eaton <[email protected]>
Signed-off-by: Will Eaton <[email protected]>
Purpose
This integrates
nixl_connector
with additional scheduler features exposed in #19330 for retrying requests that have failed blocks, also handles and propogates nixl exceptions that previously would result in crash of the decode server even if transient.Features:
finish_reason=abort
is not technically spec compliant, so this has been changed to throw a 500, signaling the request should be retried. These changes could be cherrypicked into another PR.auto
, which is a strategy that uses a (currently hardcoded) num tokens to decide whether or not to do local prefill or abortTest Plan
For integration testing, tested injecting faults using a vllm process instrumented with https://github.com/wseaton/ucx-fault-injector/, which forces nixl exceptions to be thrown during transfer.
Test Result
Working load failure recovery for transfer initiation failures. Working on triggering failures during block read/notifications.