Skip to content

Conversation

StephanTLavavej
Copy link
Member

@StephanTLavavej StephanTLavavej commented Sep 22, 2025

🗺️ Overview

Exactly 3 years ago, @MattStephanson's #3012 implemented @lemire's algorithm "Fast Random Integer Generation in an Interval" for TR1 uniform_int and Standard uniform_int_distribution, shipped in VS 2022 17.5.

This PR extends the use of the algorithm to sample(), shuffle(), ranges::sample, and ranges::shuffle. It's a behavioral change (the results of sampling and shuffling will be different), but the significant speedup is worth it.

We have to fix some warnings in _Rng_from_urng_v2 because it was never previously exposed to signed difference types.

My #5712 very recently split TR1 uniform_int from Standard uniform_int_distribution. The former still switches between my old _Rng_from_urng and the new Lemire-powered _Rng_from_urng_v2 depending on whether it senses the presence of static constexpr min() and max() from the engine, as Lemire's algorithm greatly benefits from having them as compile-time constants. After #5712, the Standard distribution now assumes it's being used with Standard engines (as it would be highly inconsistent for a program to be mixing the Standard distribution with a TR1 engine; both spellings should be updated simultaneously).

As shuffle() was C++11 and sample() was C++17, it is slightly more conceivable for those algorithms to be used with old TR1 engines. For the moment, to preserve source compatibility, I am using the same "switch between v1 and v2" logic that TR1 uniform_int uses (and that Standard uniform_int_distribution indirectly used between #3012 and #5712).

In the near future, I plan to purge TR1 entirely, making use of v2 unconditional, but I want to land this improvement separately.

⚙️ Commits

  • Add benchmarks for sample() and shuffle().
    • I didn't observe interesting length dependence, so to reduce output verbosity, I selected a single length.
  • Move _Rng_from_urng_v2 up to <algorithm>, no other changes.
  • Move _Has_static_min_max up to <algorithm>, no other changes.
  • Extract _Rng_from_urng_v1_or_v2.
  • Cleanup _Rng_from_urng_v2: static constexpr => constexpr for a local constant.
  • Fix URBG usage: min() and max() must be constexpr.
  • Fix URBG usage: Must provide result_type, must be unsigned integer.
    • Thanks @MattStephanson for bring this squirrelly type to my attention. 🐿️
  • Behavioral change: Use _Rng_from_urng_v1_or_v2 to optimize sample(), shuffle(), ranges::sample, ranges::shuffle.
  • Fix warning C4018: '<': signed/unsigned mismatch
    • This was comparing _Rem < _Index, where _Rem is _Udiff & _Udiff (almost always _Udiff) while we have potentially signed _Diff _Index.
  • Fix warning C4365: 'initializing': conversion from 'int' to 'unsigned __int64', signed/unsigned mismatch
    • _Urng::result_type must be unsigned, but result_type - result_type performs the usual arithmetic conversions, promoting tiny types to int. Therefore, we need to static_cast<_Udiff>.
  • Fix error C2397: conversion from 'const int' to 'unsigned __int64' requires a narrowing conversion
    • Except for iterators with exotic difference types:
    • For x64 algorithms, _Diff is naturally 64-bit, so is_same_v<_Udiff, uint64_t> is naturally taken.
    • But for x86 algorithms, _Diff is naturally 32-bit. If the URBG is 32-bit (or smaller), then _Udiff will be uint32_t, and the else branch will be taken.
    • In that case, the error correctly complains that we're using braces to convert from signed _Diff _Index to unsigned _Uprod, which is narrowing. We should directly static_cast instead.
  • Fix warning C4018: '<': signed/unsigned mismatch
    • Again, (_Urng::max) () - (_Urng::min) () is result_type - result_type, subject to the usual arithmetic conversions. When compared against _Udiff _Bmask_local, this can emit signed/unsigned mismatch warnings.
    • (This repro is x86-specific because MSVC emits warning C4018 at level 3 for int < unsigned int. int < unsigned long long emits an off-by-default warning C4388 "signed/unsigned mismatch". So even though the warning behavior is architecture-neutral, the varying size of _Udiff causes the warning to be x86-specific.)
    • I'm avoiding extracting the max - min as a constant to avoid VSO-2580691 "/analyze emits bogus warning C6295 (Loop executes infinitely) for a loop that immediately finishes".

⏱️ Benchmark results

On my 5950X:

Benchmark Before After Speedup
bm_sample<uint8_t, alg_type::std_fn>/1048576/32768 4395155 ns 3719749 ns 1.18
bm_sample<uint16_t, alg_type::std_fn>/1048576/32768 4422082 ns 3740169 ns 1.18
bm_sample<uint32_t, alg_type::std_fn>/1048576/32768 4429234 ns 3545233 ns 1.25
bm_sample<uint64_t, alg_type::std_fn>/1048576/32768 4421113 ns 3564755 ns 1.24
bm_sample<uint8_t, alg_type::rng>/1048576/32768 4407065 ns 3926217 ns 1.12
bm_sample<uint16_t, alg_type::rng>/1048576/32768 4408346 ns 3531568 ns 1.25
bm_sample<uint32_t, alg_type::rng>/1048576/32768 4409194 ns 3527145 ns 1.25
bm_sample<uint64_t, alg_type::rng>/1048576/32768 4454799 ns 3556353 ns 1.25
--- --- --- ---
bm_shuffle<uint8_t, alg_type::std_fn>/1048576 5750192 ns 3733152 ns 1.54
bm_shuffle<uint16_t, alg_type::std_fn>/1048576 5907846 ns 4186912 ns 1.41
bm_shuffle<uint32_t, alg_type::std_fn>/1048576 6043149 ns 4383343 ns 1.38
bm_shuffle<uint64_t, alg_type::std_fn>/1048576 6117446 ns 4582720 ns 1.33
bm_shuffle<uint8_t, alg_type::rng>/1048576 5701609 ns 3719451 ns 1.53
bm_shuffle<uint16_t, alg_type::rng>/1048576 5735311 ns 4229480 ns 1.36
bm_shuffle<uint32_t, alg_type::rng>/1048576 5771077 ns 4391066 ns 1.31
bm_shuffle<uint64_t, alg_type::rng>/1048576 5794527 ns 4587342 ns 1.26

@AlexGuteniev
Copy link
Contributor

AlexGuteniev commented Sep 23, 2025

12th Gen Intel(R) Core(TM) i5-1235U (1.30 GHz)

P-cores

Benchmark Before After Speedup
bm_sample<uint8_t, alg_type::std_fn>/1048576/32768 8139511 ns 2583362 ns 3.15
bm_sample<uint16_t, alg_type::std_fn>/1048576/32768 8189542 ns 2597715 ns 3.15
bm_sample<uint32_t, alg_type::std_fn>/1048576/32768 8110092 ns 2619239 ns 3.10
bm_sample<uint64_t, alg_type::std_fn>/1048576/32768 8338071 ns 2654214 ns 3.14
bm_sample<uint8_t, alg_type::rng>/1048576/32768 8152096 ns 2558043 ns 3.19
bm_sample<uint16_t, alg_type::rng>/1048576/32768 8073966 ns 2588999 ns 3.12
bm_sample<uint32_t, alg_type::rng>/1048576/32768 8122658 ns 2576719 ns 3.15
bm_sample<uint64_t, alg_type::rng>/1048576/32768 8325782 ns 2630247 ns 3.17
---
bm_shuffle<uint8_t, alg_type::std_fn>/1048576 5458182 ns 2460465 ns 2.22
bm_shuffle<uint16_t, alg_type::std_fn>/1048576 5535638 ns 3460889 ns 1.60
bm_shuffle<uint32_t, alg_type::std_fn>/1048576 7047519 ns 6212791 ns 1.13
bm_shuffle<uint64_t, alg_type::std_fn>/1048576 8753383 ns 8682911 ns 1.01
bm_shuffle<uint8_t, alg_type::rng>/1048576 5394413 ns 2463707 ns 2.19
bm_shuffle<uint16_t, alg_type::rng>/1048576 5604829 ns 3438364 ns 1.63
bm_shuffle<uint32_t, alg_type::rng>/1048576 7024632 ns 6182488 ns 1.14
bm_shuffle<uint64_t, alg_type::rng>/1048576 8673617 ns 8852352 ns 0.98

E-cores

Benchmark Before After Speedup
bm_sample<uint8_t, alg_type::std_fn>/1048576/32768 17235590 ns 4732827 ns 3.64
bm_sample<uint16_t, alg_type::std_fn>/1048576/32768 17221927 ns 4785911 ns 3.60
bm_sample<uint32_t, alg_type::std_fn>/1048576/32768 17260816 ns 4856078 ns 3.55
bm_sample<uint64_t, alg_type::std_fn>/1048576/32768 17377549 ns 5002572 ns 3.47
bm_sample<uint8_t, alg_type::rng>/1048576/32768 17247793 ns 4738026 ns 3.64
bm_sample<uint16_t, alg_type::rng>/1048576/32768 17265388 ns 4938240 ns 3.50
bm_sample<uint32_t, alg_type::rng>/1048576/32768 17296810 ns 4758713 ns 3.63
bm_sample<uint64_t, alg_type::rng>/1048576/32768 17392961 ns 5018447 ns 3.47
---
bm_shuffle<uint8_t, alg_type::std_fn>/1048576 11490609 ns 5384542 ns 2.13
bm_shuffle<uint16_t, alg_type::std_fn>/1048576 11678696 ns 6111882 ns 1.91
bm_shuffle<uint32_t, alg_type::std_fn>/1048576 17002617 ns 14125616 ns 1.20
bm_shuffle<uint64_t, alg_type::std_fn>/1048576 24842196 ns 23817872 ns 1.04
bm_shuffle<uint8_t, alg_type::rng>/1048576 11482229 ns 5481054 ns 2.09
bm_shuffle<uint16_t, alg_type::rng>/1048576 11877893 ns 6066280 ns 1.96
bm_shuffle<uint32_t, alg_type::rng>/1048576 16942518 ns 13997628 ns 1.21
bm_shuffle<uint64_t, alg_type::rng>/1048576 24312082 ns 24257977 ns 1.00

@AlexGuteniev

This comment was marked as resolved.

@StephanTLavavej

This comment was marked as resolved.

Benchmark                                             |     Before |      After | Speedup
------------------------------------------------------|------------|------------|--------
`bm_sample<uint8_t, alg_type::std_fn>/1048576/32768`  | 4395155 ns | 3719749 ns | 1.18
`bm_sample<uint16_t, alg_type::std_fn>/1048576/32768` | 4422082 ns | 3740169 ns | 1.18
`bm_sample<uint32_t, alg_type::std_fn>/1048576/32768` | 4429234 ns | 3545233 ns | 1.25
`bm_sample<uint64_t, alg_type::std_fn>/1048576/32768` | 4421113 ns | 3564755 ns | 1.24
`bm_sample<uint8_t, alg_type::rng>/1048576/32768`     | 4407065 ns | 3926217 ns | 1.12
`bm_sample<uint16_t, alg_type::rng>/1048576/32768`    | 4408346 ns | 3531568 ns | 1.25
`bm_sample<uint32_t, alg_type::rng>/1048576/32768`    | 4409194 ns | 3527145 ns | 1.25
`bm_sample<uint64_t, alg_type::rng>/1048576/32768`    | 4454799 ns | 3556353 ns | 1.25
------------------------------------------------------|------------|------------|-----
`bm_shuffle<uint8_t, alg_type::std_fn>/1048576`       | 5750192 ns | 3733152 ns | 1.54
`bm_shuffle<uint16_t, alg_type::std_fn>/1048576`      | 5907846 ns | 4186912 ns | 1.41
`bm_shuffle<uint32_t, alg_type::std_fn>/1048576`      | 6043149 ns | 4383343 ns | 1.38
`bm_shuffle<uint64_t, alg_type::std_fn>/1048576`      | 6117446 ns | 4582720 ns | 1.33
`bm_shuffle<uint8_t, alg_type::rng>/1048576`          | 5701609 ns | 3719451 ns | 1.53
`bm_shuffle<uint16_t, alg_type::rng>/1048576`         | 5735311 ns | 4229480 ns | 1.36
`bm_shuffle<uint32_t, alg_type::rng>/1048576`         | 5771077 ns | 4391066 ns | 1.31
`bm_shuffle<uint64_t, alg_type::rng>/1048576`         | 5794527 ns | 4587342 ns | 1.26
…)`, `shuffle()`, `ranges::sample`, `ranges::shuffle`.
This was comparing `_Rem < _Index`,
where `_Rem` is `_Udiff & _Udiff` (almost always `_Udiff`)
while we have potentially signed `_Diff _Index`.
… __int64', signed/unsigned mismatch

`_Urng::result_type` must be unsigned, but `result_type - result_type` performs the usual arithmetic conversions,
promoting tiny types to int. Therefore, we need to `static_cast<_Udiff>`.
…quires a narrowing conversion

Except for iterators with exotic difference types:

For x64 algorithms, `_Diff` is naturally 64-bit, so `is_same_v<_Udiff, uint64_t>` is naturally taken.

But for x86 algorithms, `_Diff` is naturally 32-bit. If the URBG is 32-bit (or smaller),
then `_Udiff` will be `uint32_t`, and the `else` branch will be taken.

In that case, the error correctly complains that we're using braces
to convert from signed `_Diff _Index` to unsigned `_Uprod`, which is narrowing.
We should directly `static_cast` instead.
Again, `(_Urng::max) () - (_Urng::min) ()` is `result_type - result_type`, subject to the usual arithmetic conversions.
When compared against `_Udiff _Bmask_local`, this can emit signed/unsigned mismatch warnings.

(This repro is x86-specific because MSVC emits warning C4018 at level 3 for `int < unsigned int`.
`int < unsigned long long` emits an off-by-default warning C4388 "signed/unsigned mismatch".
So even though the warning behavior is architecture-neutral,
the varying size of `_Udiff` causes the warning to be x86-specific.)

I'm avoiding extracting the `max - min` as a constant to avoid VSO-2580691 "`/analyze` emits bogus warning C6295 (Loop executes infinitely) for a loop that immediately finishes".
@StephanTLavavej

This comment was marked as resolved.

@StephanTLavavej
Copy link
Member Author

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

Copy link
Member

@davidmrdavid davidmrdavid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🦕

Left 2 small comments/questions, non-blocking.

Comment on lines +57 to +65
using result_type = uint16_t; // N5014 [rand.req.urng]/3
static constexpr result_type min() {
return 3;
}
static constexpr bool max() {
return true;
static constexpr result_type max() {
return 1729;
}
bool operator()() & {
return false;
result_type operator()() & {
return 4;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just to be triple sure - these numbers are completely arbitrary, right? I'd be a fan of calling that out, but I won't insist.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's a compile-time only test, and the actual values involved are not relevant.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made a note to add comments to this generator in a followup PR.

@StephanTLavavej StephanTLavavej moved this from Final Review to Merging in STL Code Reviews Sep 25, 2025
@StephanTLavavej StephanTLavavej merged commit 5913185 into microsoft:main Sep 25, 2025
39 checks passed
@StephanTLavavej StephanTLavavej deleted the shuffle-your-library branch September 25, 2025 22:35
@github-project-automation github-project-automation bot moved this from Merging to Done in STL Code Reviews Sep 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants