Skip to content

Conversation

kg
Copy link
Member

@kg kg commented Mar 4, 2025

Issue #109410 appears to be a case where klass is 0 when we perform an isinst operation, but the cache and obj are nonzero and look like valid addresses. klass is either a compile-time (well, jit-time) constant or being fetched out of the cache (it looks like it can be either depending on some sort of rgctx condition).

This PR adds null checks in two places along with a memory barrier in the location where we believe an uninitialized cache is being published to other threads.

@ghost ghost added the area-VM-meta-mono label Mar 4, 2025
@kg kg marked this pull request as ready for review March 5, 2025 00:19
@Copilot Copilot AI review requested due to automatic review settings March 5, 2025 00:19
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

@kg
Copy link
Member Author

kg commented Mar 5, 2025

@BrzVlad noticed that we might need a memory barrier to prevent other threads from observing a NULL klass in the cache, which is compatible with the failure on CI and the failures reported in the field. Both CI and the reports involve multithreading, so it's possible there's a race here.

We haven't seen this on x64, which makes sense because x86 and x64 have a stronger memory model, but I need to figure out whether this particular failure is possible with ARM's weaker memory model or not.

Remove emitted nullcheck in jitcode (too expensive)
@kg kg changed the title [mono] Speculative assertions for issue 109410 [mono] Fix for issue 109410 Mar 5, 2025
@steveisok steveisok requested review from lateralusX and BrzVlad March 5, 2025 16:46
{
error_init (error);

g_assert (klass);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assert is not particularly helpful, we will crash immediately at the line below anyway.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on wasm we wouldn't, but i don't know if this code runs on wasm

// we need a barrier before publishing data via mrgctx->infos [i] because the contents of data may not
// have been published to all cores and another thread may read zeroes or partially initialized data
// out of it, even though we have a barrier before publication of entries in mrgctx->entries below
mono_memory_barrier();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could an option be to use a local array in this loop and then do a memcpy into mrgctx->infos after the memory barrier we already fire below (where we copy entries)? That way we will only fire one memory barrier instead of info->num_entries.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels like a breakeven performance optimization due to the cost of making the local array and copying at the end. Do you want me to do it before merging? I don't mind if you believe it's important.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will do a full memory barrier so it will have wider impact cross cpu caches. using local stack memory and memcpy won't prevent other cores from running at full speed, but since this is just on the init path and then never again, it might not be worth optimizing. You can keep as is.

@lateralusX lateralusX self-requested a review March 19, 2025 17:26
@kg kg merged commit 8cab0b2 into dotnet:main Mar 19, 2025
65 of 68 checks passed
@srxqds
Copy link
Contributor

srxqds commented Mar 20, 2025

could you backport to release/9.0 ?

@kg
Copy link
Member Author

kg commented Mar 20, 2025

/backport to release/9.0-staging

Copy link
Contributor

Started backporting to release/9.0-staging: https://github.com/dotnet/runtime/actions/runs/13976774768

steveisok pushed a commit that referenced this pull request Mar 31, 2025
… multi-threaded scenarios (#113740)

Backport of #113140

Issue #109410 appears to be a case where klass is 0 when we perform an isinst operation, but the cache and obj are nonzero and look like valid addresses. klass is either a compile-time (well, jit-time) constant or being fetched out of the cache (it looks like it can be either depending on some sort of rgctx condition).

This PR adds null checks in two places along with a memory barrier in the location where we believe an uninitialized cache is being published to other threads.

---------

Co-authored-by: Katelyn Gadd <[email protected]>
@github-actions github-actions bot locked and limited conversation to collaborators Apr 20, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants