-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
libexpr: Use 16-byte atomic loads/stores MOVAPS/MOVDQA for ValueStora… #13964
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Thanks, I'll apply this to the parallel eval branch and do some benchmarking! |
* They are just not guaranteed to be atomic without AVX. | ||
* | ||
* For more details see: | ||
* - [^] Intel® 64 and IA-32 Architectures Software Developer’s Manual (10.1.1 Guaranteed Atomic Operations). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in Volume 3A
* so happens on x86_64 with AVX, MOVDQA/MOVAPS instructions (16-byte aligned | ||
* 128-bit loads and stores) are atomic [^]. Note that | ||
* these instructions are not part of AVX but rather SSE2, which is x86_64-v1. | ||
* They are just not guaranteed to be atomic without AVX. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the optimization done by this PR.
AFAIK there is no performance difference between MOVDQA
and MOVDQU
on modern hardware, if the memory is aligned. There won't be any unaligned accesses.
I doubt that atomic stores are sufficient in the multi threaded evaluator. Acquire/release semantic is required. I don't think that there is any way around something like LOCK CMPXCHG16B
(maybe for writes), while MOVDQA
could be used for reads.
Is it guaranteed that the compiler cannot reorder memory accesses across intrinsics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Acquire/release semantic is required.
Indeed, libatomic uses an mfence
for stores. I was waiting for @edolstra for feedback for what would be required for the parallel eval branch.
CMPXCHG16B
Yes, DCAS is also necessary for thunk locking.
Basically the whole idea for this patch is to show that this hackery (https://github.com/DeterminateSystems/nix-src/blob/766f43aa6acb1b3578db488c19fbbedf04ed9f24/src/libexpr/include/nix/expr/value.hh#L526-L528, https://github.com/DeterminateSystems/nix-src/blob/766f43aa6acb1b3578db488c19fbbedf04ed9f24/src/libexpr/include/nix/expr/value.hh#L471-L472) can be avoided. Instead of splitting the ValueStorage
in this dubious way we'd instead have CMPXCHG16B in places of (https://github.com/DeterminateSystems/nix-src/blob/766f43aa6acb1b3578db488c19fbbedf04ed9f24/src/libexpr/include/nix/expr/eval-inline.hh#L106-L113)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the optimization done by this PR.
This isn't really an optimization, more of a laying groundwork to facilitate parallel eval. It's missing some pieces like release semantics for stores, which would really just be achieved with __atomic_compare_exchange
in places where thunks get updated. So updatePayload
doesn't really need to have release semantics, but the atomic load is crucial and can't be achieved by other means. Standard atomics go through libatomic's IFUNCs, which is really bad for performance as noted by Eelco.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Compiler builtins could be used, e.g. std::atomic<unsigned __int128>
, but this produces function calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Standard atomics go through libatomic's IFUNCs, which is really bad for performance as noted by Eelco.
That's what I mention above, yes. Default implementaion of 16 byte atomics is useless.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Acquire/release semantic is required.
Indeed, libatomic uses an
mfence
for stores. I was waiting for @edolstra for feedback for what would be required for the parallel eval branch.
Indeed no LOCK CMPXCHG16B
or mfence
instruction is needed because MOVDQA
pairs have acquire/release semantics when the feature flag CPUID.01H:ECX.AVX[bit 28]
is set and memory_order_seq_cst
semantic is not required.
Thus, dynamic dispatch is required because of older CPUs.
Hence your implementation is OK from the hardware point of view, except that the compiler needs to know about this, i.e. that no memory accesses are reordered across the atomic memory accesses and that the atomic memory accesses aren't optimized out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thus, dynamic dispatch is required because of older CPUs.
Yeah, the idea is to not support parallel eval without AVX. Seems like even a 10 year old potato like ivy bridge has it, so it's not a big deal.
the compiler needs to know about this
We can plonk down [std::atomic_thread_fence(std::memory_order_acq_rel)
] std::memory_order_acq_rel. On x86 that should be purely an optimization barrier without any instructions being emittted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @NaN-git btw, your expertise in x86 and memory models is much appreciated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AVX specifically? Put another way, would recent feature-full versions of ARM processors (Graviton, Apple's M-series, etc.) be supported?
…ge<8> See the extensive comment in the code for the reasons why we have to do it this way. The motivation for the change is to simplify the eventual implementation of parallel evaluation. Being able to atomically update the whole Value is much easier to reason about.
a9cab3a
to
e1d1a89
Compare
…ge<8>
Motivation
See the extensive comment in the code for the reasons why we have to do it this way. The motivation for the change is to simplify the eventual implementation of parallel evaluation. Being able to atomically update the whole Value is much easier to reason about.
Context
Add 👍 to pull requests you find important.
The Nix maintainer team uses a GitHub project board to schedule and track reviews.