`<algorithm>`: Optimize `sample()` and `shuffle()` with Lemire's algorithm #5735

StephanTLavavej · 2025-09-22T20:06:22Z

🗺️ Overview

Exactly 3 years ago, @MattStephanson's #3012 implemented @lemire's algorithm "Fast Random Integer Generation in an Interval" for TR1 uniform_int and Standard uniform_int_distribution, shipped in VS 2022 17.5.

This PR extends the use of the algorithm to sample(), shuffle(), ranges::sample, and ranges::shuffle. It's a behavioral change (the results of sampling and shuffling will be different), but the significant speedup is worth it.

We have to fix some warnings in _Rng_from_urng_v2 because it was never previously exposed to signed difference types.

My #5712 very recently split TR1 uniform_int from Standard uniform_int_distribution. The former still switches between my old _Rng_from_urng and the new Lemire-powered _Rng_from_urng_v2 depending on whether it senses the presence of static constexpr min() and max() from the engine, as Lemire's algorithm greatly benefits from having them as compile-time constants. After #5712, the Standard distribution now assumes it's being used with Standard engines (as it would be highly inconsistent for a program to be mixing the Standard distribution with a TR1 engine; both spellings should be updated simultaneously).

As shuffle() was C++11 and sample() was C++17, it is slightly more conceivable for those algorithms to be used with old TR1 engines. For the moment, to preserve source compatibility, I am using the same "switch between v1 and v2" logic that TR1 uniform_int uses (and that Standard uniform_int_distribution indirectly used between #3012 and #5712).

In the near future, I plan to purge TR1 entirely, making use of v2 unconditional, but I want to land this improvement separately.

⚙️ Commits

Add benchmarks for sample() and shuffle().
- I didn't observe interesting length dependence, so to reduce output verbosity, I selected a single length.
Move _Rng_from_urng_v2 up to <algorithm>, no other changes.
Move _Has_static_min_max up to <algorithm>, no other changes.
Extract _Rng_from_urng_v1_or_v2.
Cleanup _Rng_from_urng_v2: static constexpr => constexpr for a local constant.
Fix URBG usage: min() and max() must be constexpr.
Fix URBG usage: Must provide result_type, must be unsigned integer.
- Thanks @MattStephanson for bring this squirrelly type to my attention. 🐿️
Behavioral change: Use _Rng_from_urng_v1_or_v2 to optimize sample(), shuffle(), ranges::sample, ranges::shuffle.
Fix warning C4018: '<': signed/unsigned mismatch
- This was comparing _Rem < _Index, where _Rem is _Udiff & _Udiff (almost always _Udiff) while we have potentially signed _Diff _Index.
Fix warning C4365: 'initializing': conversion from 'int' to 'unsigned __int64', signed/unsigned mismatch
- _Urng::result_type must be unsigned, but result_type - result_type performs the usual arithmetic conversions, promoting tiny types to int. Therefore, we need to static_cast<_Udiff>.
Fix error C2397: conversion from 'const int' to 'unsigned __int64' requires a narrowing conversion
- Except for iterators with exotic difference types:
- For x64 algorithms, _Diff is naturally 64-bit, so is_same_v<_Udiff, uint64_t> is naturally taken.
- But for x86 algorithms, _Diff is naturally 32-bit. If the URBG is 32-bit (or smaller), then _Udiff will be uint32_t, and the else branch will be taken.
- In that case, the error correctly complains that we're using braces to convert from signed _Diff _Index to unsigned _Uprod, which is narrowing. We should directly static_cast instead.
Fix warning C4018: '<': signed/unsigned mismatch
- Again, (_Urng::max) () - (_Urng::min) () is result_type - result_type, subject to the usual arithmetic conversions. When compared against _Udiff _Bmask_local, this can emit signed/unsigned mismatch warnings.
- (This repro is x86-specific because MSVC emits warning C4018 at level 3 for int < unsigned int. int < unsigned long long emits an off-by-default warning C4388 "signed/unsigned mismatch". So even though the warning behavior is architecture-neutral, the varying size of _Udiff causes the warning to be x86-specific.)
- I'm avoiding extracting the max - min as a constant to avoid VSO-2580691 "/analyze emits bogus warning C6295 (Loop executes infinitely) for a loop that immediately finishes".

⏱️ Benchmark results

On my 5950X:

Benchmark	Before	After	Speedup
`bm_sample<uint8_t, alg_type::std_fn>/1048576/32768`	4395155 ns	3719749 ns	1.18
`bm_sample<uint16_t, alg_type::std_fn>/1048576/32768`	4422082 ns	3740169 ns	1.18
`bm_sample<uint32_t, alg_type::std_fn>/1048576/32768`	4429234 ns	3545233 ns	1.25
`bm_sample<uint64_t, alg_type::std_fn>/1048576/32768`	4421113 ns	3564755 ns	1.24
`bm_sample<uint8_t, alg_type::rng>/1048576/32768`	4407065 ns	3926217 ns	1.12
`bm_sample<uint16_t, alg_type::rng>/1048576/32768`	4408346 ns	3531568 ns	1.25
`bm_sample<uint32_t, alg_type::rng>/1048576/32768`	4409194 ns	3527145 ns	1.25
`bm_sample<uint64_t, alg_type::rng>/1048576/32768`	4454799 ns	3556353 ns	1.25
---	---	---	---
`bm_shuffle<uint8_t, alg_type::std_fn>/1048576`	5750192 ns	3733152 ns	1.54
`bm_shuffle<uint16_t, alg_type::std_fn>/1048576`	5907846 ns	4186912 ns	1.41
`bm_shuffle<uint32_t, alg_type::std_fn>/1048576`	6043149 ns	4383343 ns	1.38
`bm_shuffle<uint64_t, alg_type::std_fn>/1048576`	6117446 ns	4582720 ns	1.33
`bm_shuffle<uint8_t, alg_type::rng>/1048576`	5701609 ns	3719451 ns	1.53
`bm_shuffle<uint16_t, alg_type::rng>/1048576`	5735311 ns	4229480 ns	1.36
`bm_shuffle<uint32_t, alg_type::rng>/1048576`	5771077 ns	4391066 ns	1.31
`bm_shuffle<uint64_t, alg_type::rng>/1048576`	5794527 ns	4587342 ns	1.26

AlexGuteniev · 2025-09-23T05:05:15Z

12th Gen Intel(R) Core(TM) i5-1235U (1.30 GHz)

P-cores

Benchmark	Before	After	Speedup
`bm_sample<uint8_t, alg_type::std_fn>/1048576/32768`	8139511 ns	2583362 ns	3.15
`bm_sample<uint16_t, alg_type::std_fn>/1048576/32768`	8189542 ns	2597715 ns	3.15
`bm_sample<uint32_t, alg_type::std_fn>/1048576/32768`	8110092 ns	2619239 ns	3.10
`bm_sample<uint64_t, alg_type::std_fn>/1048576/32768`	8338071 ns	2654214 ns	3.14
`bm_sample<uint8_t, alg_type::rng>/1048576/32768`	8152096 ns	2558043 ns	3.19
`bm_sample<uint16_t, alg_type::rng>/1048576/32768`	8073966 ns	2588999 ns	3.12
`bm_sample<uint32_t, alg_type::rng>/1048576/32768`	8122658 ns	2576719 ns	3.15
`bm_sample<uint64_t, alg_type::rng>/1048576/32768`	8325782 ns	2630247 ns	3.17
---
`bm_shuffle<uint8_t, alg_type::std_fn>/1048576`	5458182 ns	2460465 ns	2.22
`bm_shuffle<uint16_t, alg_type::std_fn>/1048576`	5535638 ns	3460889 ns	1.60
`bm_shuffle<uint32_t, alg_type::std_fn>/1048576`	7047519 ns	6212791 ns	1.13
`bm_shuffle<uint64_t, alg_type::std_fn>/1048576`	8753383 ns	8682911 ns	1.01
`bm_shuffle<uint8_t, alg_type::rng>/1048576`	5394413 ns	2463707 ns	2.19
`bm_shuffle<uint16_t, alg_type::rng>/1048576`	5604829 ns	3438364 ns	1.63
`bm_shuffle<uint32_t, alg_type::rng>/1048576`	7024632 ns	6182488 ns	1.14
`bm_shuffle<uint64_t, alg_type::rng>/1048576`	8673617 ns	8852352 ns	0.98

E-cores

Benchmark	Before	After	Speedup
`bm_sample<uint8_t, alg_type::std_fn>/1048576/32768`	17235590 ns	4732827 ns	3.64
`bm_sample<uint16_t, alg_type::std_fn>/1048576/32768`	17221927 ns	4785911 ns	3.60
`bm_sample<uint32_t, alg_type::std_fn>/1048576/32768`	17260816 ns	4856078 ns	3.55
`bm_sample<uint64_t, alg_type::std_fn>/1048576/32768`	17377549 ns	5002572 ns	3.47
`bm_sample<uint8_t, alg_type::rng>/1048576/32768`	17247793 ns	4738026 ns	3.64
`bm_sample<uint16_t, alg_type::rng>/1048576/32768`	17265388 ns	4938240 ns	3.50
`bm_sample<uint32_t, alg_type::rng>/1048576/32768`	17296810 ns	4758713 ns	3.63
`bm_sample<uint64_t, alg_type::rng>/1048576/32768`	17392961 ns	5018447 ns	3.47
---
`bm_shuffle<uint8_t, alg_type::std_fn>/1048576`	11490609 ns	5384542 ns	2.13
`bm_shuffle<uint16_t, alg_type::std_fn>/1048576`	11678696 ns	6111882 ns	1.91
`bm_shuffle<uint32_t, alg_type::std_fn>/1048576`	17002617 ns	14125616 ns	1.20
`bm_shuffle<uint64_t, alg_type::std_fn>/1048576`	24842196 ns	23817872 ns	1.04
`bm_shuffle<uint8_t, alg_type::rng>/1048576`	11482229 ns	5481054 ns	2.09
`bm_shuffle<uint16_t, alg_type::rng>/1048576`	11877893 ns	6066280 ns	1.96
`bm_shuffle<uint32_t, alg_type::rng>/1048576`	16942518 ns	13997628 ns	1.21
`bm_shuffle<uint64_t, alg_type::rng>/1048576`	24312082 ns	24257977 ns	1.00

Benchmark | Before | After | Speedup ------------------------------------------------------|------------|------------|-------- `bm_sample<uint8_t, alg_type::std_fn>/1048576/32768` | 4395155 ns | 3719749 ns | 1.18 `bm_sample<uint16_t, alg_type::std_fn>/1048576/32768` | 4422082 ns | 3740169 ns | 1.18 `bm_sample<uint32_t, alg_type::std_fn>/1048576/32768` | 4429234 ns | 3545233 ns | 1.25 `bm_sample<uint64_t, alg_type::std_fn>/1048576/32768` | 4421113 ns | 3564755 ns | 1.24 `bm_sample<uint8_t, alg_type::rng>/1048576/32768` | 4407065 ns | 3926217 ns | 1.12 `bm_sample<uint16_t, alg_type::rng>/1048576/32768` | 4408346 ns | 3531568 ns | 1.25 `bm_sample<uint32_t, alg_type::rng>/1048576/32768` | 4409194 ns | 3527145 ns | 1.25 `bm_sample<uint64_t, alg_type::rng>/1048576/32768` | 4454799 ns | 3556353 ns | 1.25 ------------------------------------------------------|------------|------------|----- `bm_shuffle<uint8_t, alg_type::std_fn>/1048576` | 5750192 ns | 3733152 ns | 1.54 `bm_shuffle<uint16_t, alg_type::std_fn>/1048576` | 5907846 ns | 4186912 ns | 1.41 `bm_shuffle<uint32_t, alg_type::std_fn>/1048576` | 6043149 ns | 4383343 ns | 1.38 `bm_shuffle<uint64_t, alg_type::std_fn>/1048576` | 6117446 ns | 4582720 ns | 1.33 `bm_shuffle<uint8_t, alg_type::rng>/1048576` | 5701609 ns | 3719451 ns | 1.53 `bm_shuffle<uint16_t, alg_type::rng>/1048576` | 5735311 ns | 4229480 ns | 1.36 `bm_shuffle<uint32_t, alg_type::rng>/1048576` | 5771077 ns | 4391066 ns | 1.31 `bm_shuffle<uint64_t, alg_type::rng>/1048576` | 5794527 ns | 4587342 ns | 1.26

…local constant.

…)`, `shuffle()`, `ranges::sample`, `ranges::shuffle`.

This was comparing `_Rem < _Index`, where `_Rem` is `_Udiff & _Udiff` (almost always `_Udiff`) while we have potentially signed `_Diff _Index`.

… __int64', signed/unsigned mismatch `_Urng::result_type` must be unsigned, but `result_type - result_type` performs the usual arithmetic conversions, promoting tiny types to int. Therefore, we need to `static_cast<_Udiff>`.

…quires a narrowing conversion Except for iterators with exotic difference types: For x64 algorithms, `_Diff` is naturally 64-bit, so `is_same_v<_Udiff, uint64_t>` is naturally taken. But for x86 algorithms, `_Diff` is naturally 32-bit. If the URBG is 32-bit (or smaller), then `_Udiff` will be `uint32_t`, and the `else` branch will be taken. In that case, the error correctly complains that we're using braces to convert from signed `_Diff _Index` to unsigned `_Uprod`, which is narrowing. We should directly `static_cast` instead.

Again, `(_Urng::max) () - (_Urng::min) ()` is `result_type - result_type`, subject to the usual arithmetic conversions. When compared against `_Udiff _Bmask_local`, this can emit signed/unsigned mismatch warnings. (This repro is x86-specific because MSVC emits warning C4018 at level 3 for `int < unsigned int`. `int < unsigned long long` emits an off-by-default warning C4388 "signed/unsigned mismatch". So even though the warning behavior is architecture-neutral, the varying size of `_Udiff` causes the warning to be x86-specific.) I'm avoiding extracting the `max - min` as a constant to avoid VSO-2580691 "`/analyze` emits bogus warning C6295 (Loop executes infinitely) for a loop that immediately finishes".

StephanTLavavej · 2025-09-25T19:21:49Z

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

davidmrdavid

LGTM 🦕

Left 2 small comments/questions, non-blocking.

davidmrdavid · 2025-09-25T21:53:01Z

tests/std/tests/P0896R4_ranges_alg_shuffle/test.cpp

+        using result_type = uint16_t; // N5014 [rand.req.urng]/3
+        static constexpr result_type min() {
+            return 3;
        }
-        static constexpr bool max() {
-            return true;
+        static constexpr result_type max() {
+            return 1729;
        }
-        bool operator()() & {
-            return false;
+        result_type operator()() & {
+            return 4;


just to be triple sure - these numbers are completely arbitrary, right? I'd be a fan of calling that out, but I won't insist.

Yes, it's a compile-time only test, and the actual values involved are not relevant.

I've made a note to add comments to this generator in a followup PR.

stl/inc/algorithm

StephanTLavavej requested a review from a team as a code owner September 22, 2025 20:06

StephanTLavavej added the performance Must go faster label Sep 22, 2025

github-project-automation bot added this to STL Code Reviews Sep 22, 2025

github-project-automation bot moved this to Initial Review in STL Code Reviews Sep 22, 2025

StephanTLavavej moved this from Initial Review to Final Review in STL Code Reviews Sep 22, 2025

StephanTLavavej mentioned this pull request Sep 22, 2025

<algorithm>: Investigate further optimizations for shuffle() and sample() #5736

Open

This comment was marked as resolved.

Sign in to view

StephanTLavavej added 12 commits September 23, 2025 04:31

Move _Rng_from_urng_v2 up to <algorithm>, no other changes.

8c0cbe6

Move _Has_static_min_max up to <algorithm>, no other changes.

5fa2951

Extract _Rng_from_urng_v1_or_v2.

30f3317

Cleanup _Rng_from_urng_v2: static constexpr => constexpr for a …

e62b977

…local constant.

Fix URBG usage: min() and max() must be constexpr.

ae1fa8a

Fix URBG usage: Must provide result_type, must be unsigned integer.

1829230

Behavioral change: Use _Rng_from_urng_v1_or_v2 to optimize `sample(…

ca969a9

…)`, `shuffle()`, `ranges::sample`, `ranges::shuffle`.

Fix warning C4018: '<': signed/unsigned mismatch

f0c8420

This was comparing `_Rem < _Index`, where `_Rem` is `_Udiff & _Udiff` (almost always `_Udiff`) while we have potentially signed `_Diff _Index`.

StephanTLavavej force-pushed the shuffle-your-library branch from 519a98d to 01090a7 Compare September 23, 2025 12:55

This comment was marked as resolved.

Sign in to view

davidmrdavid approved these changes Sep 25, 2025

View reviewed changes

StephanTLavavej moved this from Final Review to Merging in STL Code Reviews Sep 25, 2025

StephanTLavavej merged commit 5913185 into microsoft:main Sep 25, 2025
39 checks passed

StephanTLavavej deleted the shuffle-your-library branch September 25, 2025 22:35

github-project-automation bot moved this from Merging to Done in STL Code Reviews Sep 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`<algorithm>`: Optimize `sample()` and `shuffle()` with Lemire's algorithm #5735

`<algorithm>`: Optimize `sample()` and `shuffle()` with Lemire's algorithm #5735

Uh oh!

StephanTLavavej commented Sep 22, 2025 •

edited

Loading

Uh oh!

AlexGuteniev commented Sep 23, 2025 •

edited

Loading

Uh oh!

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

StephanTLavavej commented Sep 25, 2025

Uh oh!

davidmrdavid left a comment

Uh oh!

davidmrdavid Sep 25, 2025

Uh oh!

StephanTLavavej Sep 25, 2025

Uh oh!

StephanTLavavej Sep 25, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

<algorithm>: Optimize sample() and shuffle() with Lemire's algorithm #5735

<algorithm>: Optimize sample() and shuffle() with Lemire's algorithm #5735

Uh oh!

Conversation

StephanTLavavej commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🗺️ Overview

⚙️ Commits

⏱️ Benchmark results

Uh oh!

AlexGuteniev commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

StephanTLavavej commented Sep 25, 2025

Uh oh!

davidmrdavid left a comment

Choose a reason for hiding this comment

Uh oh!

davidmrdavid Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

StephanTLavavej Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

StephanTLavavej Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

`<algorithm>`: Optimize `sample()` and `shuffle()` with Lemire's algorithm #5735

`<algorithm>`: Optimize `sample()` and `shuffle()` with Lemire's algorithm #5735

StephanTLavavej commented Sep 22, 2025 •

edited

Loading

AlexGuteniev commented Sep 23, 2025 •

edited

Loading