Skip to content

Conversation

xokdvium
Copy link
Contributor

…ge<8>

Motivation

See the extensive comment in the code for the reasons why we have to do it this way. The motivation for the change is to simplify the eventual implementation of parallel evaluation. Being able to atomically update the whole Value is much easier to reason about.

Context


Add 👍 to pull requests you find important.

The Nix maintainer team uses a GitHub project board to schedule and track reviews.

@edolstra
Copy link
Member

Thanks, I'll apply this to the parallel eval branch and do some benchmarking!

* They are just not guaranteed to be atomic without AVX.
*
* For more details see:
* - [^] Intel® 64 and IA-32 Architectures Software Developer’s Manual (10.1.1 Guaranteed Atomic Operations).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in Volume 3A

* so happens on x86_64 with AVX, MOVDQA/MOVAPS instructions (16-byte aligned
* 128-bit loads and stores) are atomic [^]. Note that
* these instructions are not part of AVX but rather SSE2, which is x86_64-v1.
* They are just not guaranteed to be atomic without AVX.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the optimization done by this PR.
AFAIK there is no performance difference between MOVDQA and MOVDQU on modern hardware, if the memory is aligned. There won't be any unaligned accesses.

I doubt that atomic stores are sufficient in the multi threaded evaluator. Acquire/release semantic is required. I don't think that there is any way around something like LOCK CMPXCHG16B (maybe for writes), while MOVDQA could be used for reads.
Is it guaranteed that the compiler cannot reorder memory accesses across intrinsics?

Copy link
Contributor Author

@xokdvium xokdvium Sep 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acquire/release semantic is required.

Indeed, libatomic uses an mfence for stores. I was waiting for @edolstra for feedback for what would be required for the parallel eval branch.

CMPXCHG16B

Yes, DCAS is also necessary for thunk locking.

Basically the whole idea for this patch is to show that this hackery (https://github.com/DeterminateSystems/nix-src/blob/766f43aa6acb1b3578db488c19fbbedf04ed9f24/src/libexpr/include/nix/expr/value.hh#L526-L528, https://github.com/DeterminateSystems/nix-src/blob/766f43aa6acb1b3578db488c19fbbedf04ed9f24/src/libexpr/include/nix/expr/value.hh#L471-L472) can be avoided. Instead of splitting the ValueStorage in this dubious way we'd instead have CMPXCHG16B in places of (https://github.com/DeterminateSystems/nix-src/blob/766f43aa6acb1b3578db488c19fbbedf04ed9f24/src/libexpr/include/nix/expr/eval-inline.hh#L106-L113)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the optimization done by this PR.

This isn't really an optimization, more of a laying groundwork to facilitate parallel eval. It's missing some pieces like release semantics for stores, which would really just be achieved with __atomic_compare_exchange in places where thunks get updated. So updatePayload doesn't really need to have release semantics, but the atomic load is crucial and can't be achieved by other means. Standard atomics go through libatomic's IFUNCs, which is really bad for performance as noted by Eelco.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compiler builtins could be used, e.g. std::atomic<unsigned __int128>, but this produces function calls.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Standard atomics go through libatomic's IFUNCs, which is really bad for performance as noted by Eelco.

That's what I mention above, yes. Default implementaion of 16 byte atomics is useless.

Copy link
Contributor

@NaN-git NaN-git Sep 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acquire/release semantic is required.

Indeed, libatomic uses an mfence for stores. I was waiting for @edolstra for feedback for what would be required for the parallel eval branch.

Indeed no LOCK CMPXCHG16B or mfence instruction is needed because MOVDQA pairs have acquire/release semantics when the feature flag CPUID.01H:ECX.AVX[bit 28] is set and memory_order_seq_cst semantic is not required.
Thus, dynamic dispatch is required because of older CPUs.

Hence your implementation is OK from the hardware point of view, except that the compiler needs to know about this, i.e. that no memory accesses are reordered across the atomic memory accesses and that the atomic memory accesses aren't optimized out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thus, dynamic dispatch is required because of older CPUs.

Yeah, the idea is to not support parallel eval without AVX. Seems like even a 10 year old potato like ivy bridge has it, so it's not a big deal.

the compiler needs to know about this

We can plonk down [std::atomic_thread_fence(std::memory_order_acq_rel)] std::memory_order_acq_rel. On x86 that should be purely an optimization barrier without any instructions being emittted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @NaN-git btw, your expertise in x86 and memory models is much appreciated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AVX specifically? Put another way, would recent feature-full versions of ARM processors (Graviton, Apple's M-series, etc.) be supported?

…ge<8>

See the extensive comment in the code for the reasons why we have to
do it this way. The motivation for the change is to simplify the eventual
implementation of parallel evaluation. Being able to atomically update the
whole Value is much easier to reason about.
@xokdvium xokdvium marked this pull request as draft September 12, 2025 18:30
@xokdvium xokdvium force-pushed the value-double-width-atomics branch from a9cab3a to e1d1a89 Compare September 12, 2025 18:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants