Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 14 additions & 4 deletions torch_xla/csrc/runtime/runtime.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ static std::atomic<bool> g_computation_client_initialized(false);
// Creates a new instance of a `ComputationClient` (e.g.
// `PjRtComputationClient`), and initializes the computation client.
// Can only be called when g_computation_client_initialized is false.
static absl::StatusOr<ComputationClient * absl_nonnull>
static absl::StatusOr<std::unique_ptr<ComputationClient>>
InitializeComputationClient() {
if (sys_util::GetEnvBool("XLA_DUMP_FATAL_STACK", false)) {
tsl::testing::InstallStacktraceHandler();
Expand Down Expand Up @@ -46,7 +46,7 @@ InitializeComputationClient() {
// Set only if we actually successfully initialized a client.
g_computation_client_initialized = true;

return client.release();
return client;
}

const absl::StatusOr<ComputationClient * absl_nonnull>& GetComputationClient() {
Expand All @@ -55,8 +55,18 @@ const absl::StatusOr<ComputationClient * absl_nonnull>& GetComputationClient() {
// Since we only allow a single initialization, as soon as this function is
// called, we store the initialization result in this trivially destructible
// reference.
static const auto& maybe_client =
*new absl::StatusOr<ComputationClient*>(InitializeComputationClient());
static absl::StatusOr<std::unique_ptr<ComputationClient>> init_result =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This violates Google's C++ style guide: https://google.github.io/styleguide/cppguide.html#Static_and_Global_Variables

For singleton objects, we deliberately do not want their destructors to be called, as that can lead to race condition at program exit time.

I'm not sure what this PR is trying to achieve. Could you clarify why you want to sure that the PjRt client dtor is called? Usually we don't destroy the singleton objects - we just let the OS reclaim the resources when the process terminates.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhanyong-wan thanks for the feedback. Could you give an example of the race condition you mentioned and why it was not addressed until v2.8?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The style guide I mentioned noted: "When destructors are trivial, their execution is not subject to ordering at all (they are effectively not "run"); otherwise we are exposed to the risk of accessing objects after the end of their lifetime. Therefore, we only allow objects with static storage duration if they are trivially destructible. Fundamental types (like pointers and int) are trivially destructible, as are arrays of trivially destructible types."

For example, at program exit time there could be long-running threads accessing global variables. If a global variable is destructed, such access is undefined behavior.

As to why it wasn't addressed until v2.8, I don't know the history, but my guess is that we just noticed the potential race and decided to fix it.

Copy link

@rajkthakur rajkthakur Oct 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In PR #9384, we introduced StatusOr<T> for error handling, which can be trivially destructible when T is trivially destructible. However, looking at PjrtComputationClient's implementation with its explicit destructor and member variables, it appears to not be trivially destructible. Could you shed some light on why we think PjrtComputationClient could be trivially destructible?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rajkthakur , StatusOr<T> is not trivially destructible, regardless of whether T is trivially destructible. PjrtComputationClient is not trivially destructible and not meant to be. I don't understand what you mean by "we think PjrtComputationClient could be trivially destructible".

Copy link

@rajkthakur rajkthakur Oct 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify why this is a problem for neuron?

The Neuron backend's resource cleanup is tied to Pjrt_Client_Destroy calls. This works in JAX and Torch/XLA through v2.7, but this refactor removed explicit destruction calls, breaking neuron's cleanup process. We have observed that relying on OS cleanup is causing unexpected hangs in some cases.

Copy link
Author

@saarthak-aws saarthak-aws Oct 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to consider the Shutdown method approach. One implementation I can imagine is to add a Shutdown method to PjrtComputationClient, which would delete the xla:PjRtClient client_ member, such as the following

void PjrtComputationClient::Shutdown(){
    auto* ptr = _client.release();
    delete ptr;
}

We could call the Shutdown method at the end of the PrepareToExit function

void PrepareToExit() {
runtime::ComputationClient* client =
runtime::GetComputationClientIfInitialized();
if (client != nullptr) {
auto xla_device = GetDeviceOrCurrent("");
SetAllReduceToken(xla_device, nullptr);
WaitDeviceOps();
}
}

which is registered atexit in __init__.py

atexit.register(_prepare_to_exit)

Since shutdown will be called at the end of the prepareToExit sequence, we should expect no further accesses to client_.

Thanks for the feedback so far. Would appreciate your thoughts on this approach

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying. It seems a bug that neuron hangs sometimes if the clean-up is left to the OS. My suggestion would be to root cause and fix that bug.

Re: the shutdown approach, I don't think we can count on no further access to client_ after the atexit hook is called. The whole point of Google's policy on global variable destruction is that there can be long-running threads after the exit hook is called. Think about the case where someone starts a computation in a long-running thread and then exit. The thread is never joined and thus may still access client_ after the program exit hook.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we investigate why leaving cleanup to the OS leaves Neuron backend in a bad state, do you have any thoughts on what would be the correct approach for implementing the Shutdown method?

We would have to leave the client_ accessible after we have destroyed the actual xla::PjRtClient (since destruction ends up calling PJRT_Client_Destroy). One way I can think of doing so is to switch to a stub implementation of _client at this point, so that long running threads can access _client, but they would get some default behavior. Is that the right approach/pattern?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the best course of action is to fix the hang, as implementing Shutdown correctly adds significant complexity to the design.

That said, here's how Shutdown should work if done correctly: it should allow in-flight computation that needs the client to finish, and it should let new computation (if any) that wants to use the client fail to get the client. This means we'll likely need to use a shared_ptr to hold the client (so that in-flight computation can extend its lifespan).

As you can see, this is doable but not trivial. Hence my advice to avoid it.

InitializeComputationClient();

static absl::StatusOr<ComputationClient* absl_nonnull> maybe_client = []() {
if (init_result.ok()) {
return absl::StatusOr<ComputationClient * absl_nonnull>(
init_result.value().get());
} else {
return absl::StatusOr<ComputationClient * absl_nonnull>(
init_result.status());
}
}();
return maybe_client;
}

Expand Down
Loading