libexpr: Use 16-byte atomic loads/stores MOVAPS/MOVDQA for ValueStora… #13964

xokdvium · 2025-09-12T01:03:51Z

…ge<8>

Motivation

See the extensive comment in the code for the reasons why we have to do it this way. The motivation for the change is to simplify the eventual implementation of parallel evaluation. Being able to atomically update the whole Value is much easier to reason about.

Context

Add 👍 to pull requests you find important.

The Nix maintainer team uses a GitHub project board to schedule and track reviews.

edolstra · 2025-09-12T15:56:07Z

Thanks, I'll apply this to the parallel eval branch and do some benchmarking!

NaN-git · 2025-09-12T17:18:11Z

src/libexpr/include/nix/expr/value.hh

+     * They are just not guaranteed to be atomic without AVX.
+     *
+     * For more details see:
+     *  - [^] Intel® 64 and IA-32 Architectures Software Developer’s Manual (10.1.1 Guaranteed Atomic Operations).


in Volume 3A

NaN-git · 2025-09-12T18:11:07Z

src/libexpr/include/nix/expr/value.hh

+     * so happens on x86_64 with AVX, MOVDQA/MOVAPS instructions (16-byte aligned
+     * 128-bit loads and stores) are atomic [^]. Note that
+     * these instructions are not part of AVX but rather SSE2, which is x86_64-v1.
+     * They are just not guaranteed to be atomic without AVX.


I don't understand the optimization done by this PR.
AFAIK there is no performance difference between MOVDQA and MOVDQU on modern hardware, if the memory is aligned. There won't be any unaligned accesses.

I doubt that atomic stores are sufficient in the multi threaded evaluator. Acquire/release semantic is required. I don't think that there is any way around something like LOCK CMPXCHG16B (maybe for writes), while MOVDQA could be used for reads.
Is it guaranteed that the compiler cannot reorder memory accesses across intrinsics?

Acquire/release semantic is required.

Indeed, libatomic uses an mfence for stores. I was waiting for @edolstra for feedback for what would be required for the parallel eval branch.

CMPXCHG16B

Yes, DCAS is also necessary for thunk locking.

Basically the whole idea for this patch is to show that this hackery (https://github.com/DeterminateSystems/nix-src/blob/766f43aa6acb1b3578db488c19fbbedf04ed9f24/src/libexpr/include/nix/expr/value.hh#L526-L528, https://github.com/DeterminateSystems/nix-src/blob/766f43aa6acb1b3578db488c19fbbedf04ed9f24/src/libexpr/include/nix/expr/value.hh#L471-L472) can be avoided. Instead of splitting the ValueStorage in this dubious way we'd instead have CMPXCHG16B in places of (https://github.com/DeterminateSystems/nix-src/blob/766f43aa6acb1b3578db488c19fbbedf04ed9f24/src/libexpr/include/nix/expr/eval-inline.hh#L106-L113)

I don't understand the optimization done by this PR.

This isn't really an optimization, more of a laying groundwork to facilitate parallel eval. It's missing some pieces like release semantics for stores, which would really just be achieved with __atomic_compare_exchange in places where thunks get updated. So updatePayload doesn't really need to have release semantics, but the atomic load is crucial and can't be achieved by other means. Standard atomics go through libatomic's IFUNCs, which is really bad for performance as noted by Eelco.

Compiler builtins could be used, e.g. std::atomic<unsigned __int128>, but this produces function calls.

Standard atomics go through libatomic's IFUNCs, which is really bad for performance as noted by Eelco.

That's what I mention above, yes. Default implementaion of 16 byte atomics is useless.

Acquire/release semantic is required.

Indeed, libatomic uses an mfence for stores. I was waiting for @edolstra for feedback for what would be required for the parallel eval branch.

Indeed no LOCK CMPXCHG16B or mfence instruction is needed because MOVDQA pairs have acquire/release semantics when the feature flag CPUID.01H:ECX.AVX[bit 28] is set and memory_order_seq_cst semantic is not required.
Thus, dynamic dispatch is required because of older CPUs.

Hence your implementation is OK from the hardware point of view, except that the compiler needs to know about this, i.e. that no memory accesses are reordered across the atomic memory accesses and that the atomic memory accesses aren't optimized out.

Thus, dynamic dispatch is required because of older CPUs.

Yeah, the idea is to not support parallel eval without AVX. Seems like even a 10 year old potato like ivy bridge has it, so it's not a big deal.

the compiler needs to know about this

We can plonk down [std::atomic_thread_fence(std::memory_order_acq_rel)] std::memory_order_acq_rel. On x86 that should be purely an optimization barrier without any instructions being emittted.

Thanks @NaN-git btw, your expertise in x86 and memory models is much appreciated.

AVX specifically? Put another way, would recent feature-full versions of ARM processors (Graviton, Apple's M-series, etc.) be supported?

…ge<8> See the extensive comment in the code for the reasons why we have to do it this way. The motivation for the change is to simplify the eventual implementation of parallel evaluation. Being able to atomically update the whole Value is much easier to reason about.

xokdvium requested a review from edolstra as a code owner September 12, 2025 01:03

xokdvium requested a review from Ericson2314 September 12, 2025 01:04

NaN-git reviewed Sep 12, 2025

View reviewed changes

xokdvium marked this pull request as draft September 12, 2025 18:30

xokdvium force-pushed the value-double-width-atomics branch from a9cab3a to e1d1a89 Compare September 12, 2025 18:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

libexpr: Use 16-byte atomic loads/stores MOVAPS/MOVDQA for ValueStora… #13964

libexpr: Use 16-byte atomic loads/stores MOVAPS/MOVDQA for ValueStora… #13964

Uh oh!

xokdvium commented Sep 12, 2025

Uh oh!

edolstra commented Sep 12, 2025

Uh oh!

NaN-git Sep 12, 2025

Uh oh!

NaN-git Sep 12, 2025

Uh oh!

xokdvium Sep 12, 2025 •

edited

Loading

Uh oh!

xokdvium Sep 12, 2025

Uh oh!

NaN-git Sep 12, 2025

Uh oh!

xokdvium Sep 12, 2025

Uh oh!

NaN-git Sep 12, 2025 •

edited

Loading

Uh oh!

xokdvium Sep 12, 2025

Uh oh!

xokdvium Sep 12, 2025

Uh oh!

ConnorBaker Sep 17, 2025

Uh oh!

Uh oh!

Uh oh!

libexpr: Use 16-byte atomic loads/stores MOVAPS/MOVDQA for ValueStora… #13964

Are you sure you want to change the base?

libexpr: Use 16-byte atomic loads/stores MOVAPS/MOVDQA for ValueStora… #13964

Uh oh!

Conversation

xokdvium commented Sep 12, 2025

Motivation

Context

Uh oh!

edolstra commented Sep 12, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xokdvium Sep 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NaN-git Sep 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

xokdvium Sep 12, 2025 •

edited

Loading

NaN-git Sep 12, 2025 •

edited

Loading