Skip to content

Conversation

saarthak-aws
Copy link

@saarthak-aws saarthak-aws commented Oct 7, 2025

Re-introducing a static unique_ptr to manage the lifecycle of PjRtComputationClient, ensuring that the destructor / destroy method of PJRT Client is called.

After building in this change, ans running the reproduction steps mentioned in #9669, I have manually confirmed that that PJRT Client destructor is called

ubuntu@ip-[redacted]:~$ export TF_CPP_MIN_LOG_LEVEL=0; export TF_CPP_VMODULE="cpu_client=1"; export NEURON_RT_LOG_LEVEL=DEBUG; export PJRT_DEVICE=CPU
(aws_neuronx_venv_pytorch_2_8) ubuntu@ip-172-31-59-9:~$ python -c "import torch_xla; device=torch_xla.device()"
WARNING:root:MASTER_ADDR environment variable is not set, defaulting to localhost
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1759797507.812598  222797 cpu_client.cc:311] PjRtCpuClient created.
I0000 00:00:1759797508.238195  222797 cpu_client.cc:314] PjRtCpuClient destroyed.
ubuntu@ip-[redacted]:~$ pip list | grep torch
torch                     2.9.0a0+git21fec65
torch-xla                 2.9.0+git11590c1

// reference.
static const auto& maybe_client =
*new absl::StatusOr<ComputationClient*>(InitializeComputationClient());
static absl::StatusOr<std::unique_ptr<ComputationClient>> init_result =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This violates Google's C++ style guide: https://google.github.io/styleguide/cppguide.html#Static_and_Global_Variables

For singleton objects, we deliberately do not want their destructors to be called, as that can lead to race condition at program exit time.

I'm not sure what this PR is trying to achieve. Could you clarify why you want to sure that the PjRt client dtor is called? Usually we don't destroy the singleton objects - we just let the OS reclaim the resources when the process terminates.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhanyong-wan thanks for the feedback. Could you give an example of the race condition you mentioned and why it was not addressed until v2.8?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The style guide I mentioned noted: "When destructors are trivial, their execution is not subject to ordering at all (they are effectively not "run"); otherwise we are exposed to the risk of accessing objects after the end of their lifetime. Therefore, we only allow objects with static storage duration if they are trivially destructible. Fundamental types (like pointers and int) are trivially destructible, as are arrays of trivially destructible types."

For example, at program exit time there could be long-running threads accessing global variables. If a global variable is destructed, such access is undefined behavior.

As to why it wasn't addressed until v2.8, I don't know the history, but my guess is that we just noticed the potential race and decided to fix it.

Copy link

@rajkthakur rajkthakur Oct 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In PR #9384, we introduced StatusOr<T> for error handling, which can be trivially destructible when T is trivially destructible. However, looking at PjrtComputationClient's implementation with its explicit destructor and member variables, it appears to not be trivially destructible. Could you shed some light on why we think PjrtComputationClient could be trivially destructible?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rajkthakur , StatusOr<T> is not trivially destructible, regardless of whether T is trivially destructible. PjrtComputationClient is not trivially destructible and not meant to be. I don't understand what you mean by "we think PjrtComputationClient could be trivially destructible".

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically our design philosophy has been that singletons shouldn't be destroyed and this is not a memory leak - all resources held by these objects will be reclaimed by the OS when the program exits, so there's no need for us to do it explicitly. Can you clarify why this is a problem for neuron? If neuron really needs the resources held by the client returned at program exit time for a good reason, we should implement a Shutdown method that cleans up the client and leaves the client object accessible, so that any long-running threads access the client won't run into undefined behavior.

Copy link

@rajkthakur rajkthakur Oct 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify why this is a problem for neuron?

The Neuron backend's resource cleanup is tied to Pjrt_Client_Destroy calls. This works in JAX and Torch/XLA through v2.7, but this refactor removed explicit destruction calls, breaking neuron's cleanup process. We have observed that relying on OS cleanup is causing unexpected hangs in some cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants