-
Notifications
You must be signed in to change notification settings - Fork 1.6k
<algorithm>
: Optimize sample()
and shuffle()
with Lemire's algorithm
#5735
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
<algorithm>
: Optimize sample()
and shuffle()
with Lemire's algorithm
#5735
Conversation
12th Gen Intel(R) Core(TM) i5-1235U (1.30 GHz) P-cores
E-cores
|
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
Benchmark | Before | After | Speedup ------------------------------------------------------|------------|------------|-------- `bm_sample<uint8_t, alg_type::std_fn>/1048576/32768` | 4395155 ns | 3719749 ns | 1.18 `bm_sample<uint16_t, alg_type::std_fn>/1048576/32768` | 4422082 ns | 3740169 ns | 1.18 `bm_sample<uint32_t, alg_type::std_fn>/1048576/32768` | 4429234 ns | 3545233 ns | 1.25 `bm_sample<uint64_t, alg_type::std_fn>/1048576/32768` | 4421113 ns | 3564755 ns | 1.24 `bm_sample<uint8_t, alg_type::rng>/1048576/32768` | 4407065 ns | 3926217 ns | 1.12 `bm_sample<uint16_t, alg_type::rng>/1048576/32768` | 4408346 ns | 3531568 ns | 1.25 `bm_sample<uint32_t, alg_type::rng>/1048576/32768` | 4409194 ns | 3527145 ns | 1.25 `bm_sample<uint64_t, alg_type::rng>/1048576/32768` | 4454799 ns | 3556353 ns | 1.25 ------------------------------------------------------|------------|------------|----- `bm_shuffle<uint8_t, alg_type::std_fn>/1048576` | 5750192 ns | 3733152 ns | 1.54 `bm_shuffle<uint16_t, alg_type::std_fn>/1048576` | 5907846 ns | 4186912 ns | 1.41 `bm_shuffle<uint32_t, alg_type::std_fn>/1048576` | 6043149 ns | 4383343 ns | 1.38 `bm_shuffle<uint64_t, alg_type::std_fn>/1048576` | 6117446 ns | 4582720 ns | 1.33 `bm_shuffle<uint8_t, alg_type::rng>/1048576` | 5701609 ns | 3719451 ns | 1.53 `bm_shuffle<uint16_t, alg_type::rng>/1048576` | 5735311 ns | 4229480 ns | 1.36 `bm_shuffle<uint32_t, alg_type::rng>/1048576` | 5771077 ns | 4391066 ns | 1.31 `bm_shuffle<uint64_t, alg_type::rng>/1048576` | 5794527 ns | 4587342 ns | 1.26
…local constant.
…)`, `shuffle()`, `ranges::sample`, `ranges::shuffle`.
This was comparing `_Rem < _Index`, where `_Rem` is `_Udiff & _Udiff` (almost always `_Udiff`) while we have potentially signed `_Diff _Index`.
… __int64', signed/unsigned mismatch `_Urng::result_type` must be unsigned, but `result_type - result_type` performs the usual arithmetic conversions, promoting tiny types to int. Therefore, we need to `static_cast<_Udiff>`.
…quires a narrowing conversion Except for iterators with exotic difference types: For x64 algorithms, `_Diff` is naturally 64-bit, so `is_same_v<_Udiff, uint64_t>` is naturally taken. But for x86 algorithms, `_Diff` is naturally 32-bit. If the URBG is 32-bit (or smaller), then `_Udiff` will be `uint32_t`, and the `else` branch will be taken. In that case, the error correctly complains that we're using braces to convert from signed `_Diff _Index` to unsigned `_Uprod`, which is narrowing. We should directly `static_cast` instead.
Again, `(_Urng::max) () - (_Urng::min) ()` is `result_type - result_type`, subject to the usual arithmetic conversions. When compared against `_Udiff _Bmask_local`, this can emit signed/unsigned mismatch warnings. (This repro is x86-specific because MSVC emits warning C4018 at level 3 for `int < unsigned int`. `int < unsigned long long` emits an off-by-default warning C4388 "signed/unsigned mismatch". So even though the warning behavior is architecture-neutral, the varying size of `_Udiff` causes the warning to be x86-specific.) I'm avoiding extracting the `max - min` as a constant to avoid VSO-2580691 "`/analyze` emits bogus warning C6295 (Loop executes infinitely) for a loop that immediately finishes".
519a98d
to
01090a7
Compare
This comment was marked as resolved.
This comment was marked as resolved.
I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🦕
Left 2 small comments/questions, non-blocking.
using result_type = uint16_t; // N5014 [rand.req.urng]/3 | ||
static constexpr result_type min() { | ||
return 3; | ||
} | ||
static constexpr bool max() { | ||
return true; | ||
static constexpr result_type max() { | ||
return 1729; | ||
} | ||
bool operator()() & { | ||
return false; | ||
result_type operator()() & { | ||
return 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just to be triple sure - these numbers are completely arbitrary, right? I'd be a fan of calling that out, but I won't insist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's a compile-time only test, and the actual values involved are not relevant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made a note to add comments to this generator in a followup PR.
🗺️ Overview
Exactly 3 years ago, @MattStephanson's #3012 implemented @lemire's algorithm "Fast Random Integer Generation in an Interval" for TR1
uniform_int
and Standarduniform_int_distribution
, shipped in VS 2022 17.5.This PR extends the use of the algorithm to
sample()
,shuffle()
,ranges::sample
, andranges::shuffle
. It's a behavioral change (the results of sampling and shuffling will be different), but the significant speedup is worth it.We have to fix some warnings in
_Rng_from_urng_v2
because it was never previously exposed to signed difference types.My #5712 very recently split TR1
uniform_int
from Standarduniform_int_distribution
. The former still switches between my old_Rng_from_urng
and the new Lemire-powered_Rng_from_urng_v2
depending on whether it senses the presence ofstatic constexpr
min()
andmax()
from the engine, as Lemire's algorithm greatly benefits from having them as compile-time constants. After #5712, the Standard distribution now assumes it's being used with Standard engines (as it would be highly inconsistent for a program to be mixing the Standard distribution with a TR1 engine; both spellings should be updated simultaneously).As
shuffle()
was C++11 andsample()
was C++17, it is slightly more conceivable for those algorithms to be used with old TR1 engines. For the moment, to preserve source compatibility, I am using the same "switch between v1 and v2" logic that TR1uniform_int
uses (and that Standarduniform_int_distribution
indirectly used between #3012 and #5712).In the near future, I plan to purge TR1 entirely, making use of v2 unconditional, but I want to land this improvement separately.
⚙️ Commits
sample()
andshuffle()
._Rng_from_urng_v2
up to<algorithm>
, no other changes._Has_static_min_max
up to<algorithm>
, no other changes._Rng_from_urng_v1_or_v2
._Rng_from_urng_v2
:static constexpr
=>constexpr
for a local constant.min()
andmax()
must beconstexpr
.result_type
, must be unsigned integer._Rng_from_urng_v1_or_v2
to optimizesample()
,shuffle()
,ranges::sample
,ranges::shuffle
._Rem < _Index
, where_Rem
is_Udiff & _Udiff
(almost always_Udiff
) while we have potentially signed_Diff _Index
.'int'
to'unsigned __int64'
, signed/unsigned mismatch_Urng::result_type
must be unsigned, butresult_type - result_type
performs the usual arithmetic conversions, promoting tiny types toint
. Therefore, we need tostatic_cast<_Udiff>
.'const int'
to'unsigned __int64'
requires a narrowing conversion_Diff
is naturally 64-bit, sois_same_v<_Udiff, uint64_t>
is naturally taken._Diff
is naturally 32-bit. If the URBG is 32-bit (or smaller), then_Udiff
will beuint32_t
, and theelse
branch will be taken._Diff _Index
to unsigned_Uprod
, which is narrowing. We should directlystatic_cast
instead.(_Urng::max) () - (_Urng::min) ()
isresult_type - result_type
, subject to the usual arithmetic conversions. When compared against_Udiff _Bmask_local
, this can emit signed/unsigned mismatch warnings.int < unsigned int
.int < unsigned long long
emits an off-by-default warning C4388 "signed/unsigned mismatch". So even though the warning behavior is architecture-neutral, the varying size of_Udiff
causes the warning to be x86-specific.)max - min
as a constant to avoid VSO-2580691 "/analyze
emits bogus warning C6295 (Loop executes infinitely) for a loop that immediately finishes".⏱️ Benchmark results
On my 5950X:
bm_sample<uint8_t, alg_type::std_fn>/1048576/32768
bm_sample<uint16_t, alg_type::std_fn>/1048576/32768
bm_sample<uint32_t, alg_type::std_fn>/1048576/32768
bm_sample<uint64_t, alg_type::std_fn>/1048576/32768
bm_sample<uint8_t, alg_type::rng>/1048576/32768
bm_sample<uint16_t, alg_type::rng>/1048576/32768
bm_sample<uint32_t, alg_type::rng>/1048576/32768
bm_sample<uint64_t, alg_type::rng>/1048576/32768
bm_shuffle<uint8_t, alg_type::std_fn>/1048576
bm_shuffle<uint16_t, alg_type::std_fn>/1048576
bm_shuffle<uint32_t, alg_type::std_fn>/1048576
bm_shuffle<uint64_t, alg_type::std_fn>/1048576
bm_shuffle<uint8_t, alg_type::rng>/1048576
bm_shuffle<uint16_t, alg_type::rng>/1048576
bm_shuffle<uint32_t, alg_type::rng>/1048576
bm_shuffle<uint64_t, alg_type::rng>/1048576