Skip to content

Conversation

AlexGuteniev
Copy link
Contributor

@AlexGuteniev AlexGuteniev commented May 18, 2025

Follow up for #5502.

Reasons to consider follow up:

  • Some cases have no improvement
  • After thinking about it, discovered that there are even cases with slight degradation

Weakness of the original approach

It deals well with extremes:

  • Small rotation
  • Rotation close to the middle

The worst case is when the rotation is small, but still large enough to not engage the small rotation branch.

Mitigation approaches

Generally, we need to do multi-range rotating swap to make fewer element assignments. From the original PR:

A hypothetical functions like swap_3_ranges, swap_4_ranges, etc could reduce the number of assignments for more cases. But going further in optimization will result in less and less improvement for more and more code added, and at some point will cause the complex decisions to take noticeable amount of time, resulting in negative improvement, so we need to stop somewhere. Probably stopping on just small rotation and two ranges swap strategy would be a good idea.

So how we can do some improvement while avoiding unnecessary complication:

  • Implement few of swap_3_ranges, swap_4_ranges, etc, but no more than two of them as separate functions.
    • ❌ Will not squeeze away unnecessary assignments too hard
    • ✅ This should be the easiest thing to do
  • Spawn many swap_N_ranges using single source and metaprogramming, pick the best one at runtime
    • ✅ Will squeeze away unnecessary assignments harder
    • ❌ Complex metaprogramming to do that using template fold expression or macros
    • ❌ In case of template implementation will heavily rely on compiler optimization
    • 🤮 Macro implementation is just not a good thing
    • ❌ Will add a lot of machine code, which will make binary bigger, and there will be a lot of "cold" code at runtime
  • Implement single swap_N_ranges that would work with variable at runtime number of ranges
    • ✅ Will squeeze away unnecessary assignments to the maximum
    • ❓ Swapping too many ranges at the same time is likely to break prefetch, need to see the impact of that
    • ❓ Will have additional runtime cost for iterating over pointers to iterate
    • ⚠️ Large power-of-two stride will cause cache conflict eviction, if N exceeds CPU cache associativity, which would result in dramatic performance degradation
    • ❌ Will have complex flow at runtime

This makes me think that it would be good to:

  • First try the simple approach of having one or two additional swap functions
  • If there's a strong indication of success in this direction, try the runtime-variable swap_N_ranges

The code chages

So I've tried _Swap_3_ranges, It resulted in at most 1.40 speedup, and that fixed the slightly regressed cases.
I think it is indication to both that the approach is good enough to use, and not too good to try something more complex.

I've moved _Rotating closer to __std_swap_ranges_trivially_swappable_noalias to make the similarity between that and _Swap_3_ranges more obvious.

Coverage

Tests were lacking too long arrays to execute the ranges swapping properly. I've expanded the test to have more elements; to save some run time, I've did this for one of 8-bit elements only. The algorithm does not distinguish element sizes internally anyway.

The same for benchmark, I've added just two examples of the case that became worse.

Benchmark results

Before #5502 / After #5502 may slightly wary from the previous PR description, I've ran the benchmarks again.

Benchmark Before #5502 After #5502 After this #5502 ⬆️ This ⬆️ Total ⬆️
u8//Std/3333/2242 93.8 ns 67.0 ns 50.6 ns 1.40 1.32 1.85
u8//Std/3332/1666 94.6 ns 40.0 ns 40.5 ns 2.37 0.99 2.34
u8//Std/3333/1111 91.4 ns 60.4 ns 44.0 ns 1.51 1.37 2.08
u8//Std/3333/501 89.9 ns 32.1 ns 32.1 ns 2.80 1.00 2.80
u8//Std/3333/3300 91.3 ns 32.3 ns 32.1 ns 2.83 1.01 2.84
u8//Std/3333/12 87.8 ns 25.9 ns 25.8 ns 3.39 1.00 3.40
u8//Std/3333/5 90.8 ns 29.0 ns 28.8 ns 3.13 1.01 3.15
u8//Std/3333/1 82.2 ns 28.8 ns 29.8 ns 2.85 0.97 2.76
u8//Std/333/101 19.0 ns 12.1 ns 10.1 ns 1.57 1.20 1.88
u8//Std/123/32 22.7 ns 6.57 ns 6.36 ns 3.46 1.03 3.57
u8//Std/23/7 18.3 ns 5.24 ns 5.51 ns 3.49 0.95 3.32
u8//Std/12/5 12.9 ns 5.26 ns 5.03 ns 2.45 1.05 2.56
u8//Std/3/2 3.42 ns 4.77 ns 4.71 ns 0.72 1.01 0.73
u8//Rng/3333/2242 94.3 ns 67.4 ns 52.8 ns 1.40 1.28 1.79
u8//Rng/3332/1666 95.9 ns 39.9 ns 41.7 ns 2.40 0.96 2.30
u8//Rng/3333/1111 93.2 ns 58.4 ns 45.9 ns 1.60 1.27 2.03
u8//Rng/3333/501 89.8 ns 31.9 ns 32.3 ns 2.82 0.99 2.78
u8//Rng/3333/3300 93.5 ns 32.5 ns 33.3 ns 2.88 0.98 2.81
u8//Rng/3333/12 89.3 ns 25.9 ns 26.0 ns 3.45 1.00 3.43
u8//Rng/3333/5 87.4 ns 29.0 ns 29.2 ns 3.01 0.99 2.99
u8//Rng/3333/1 83.1 ns 29.0 ns 28.9 ns 2.87 1.00 2.88
u8//Rng/333/101 18.4 ns 12.1 ns 11.3 ns 1.52 1.07 1.63
u8//Rng/123/32 26.1 ns 6.56 ns 6.44 ns 3.98 1.02 4.05
u8//Rng/23/7 18.5 ns 5.20 ns 5.22 ns 3.56 1.00 3.54
u8//Rng/12/5 13.2 ns 5.28 ns 4.93 ns 2.50 1.07 2.68
u8//Rng/3/2 3.33 ns 4.77 ns 4.73 ns 0.70 1.01 0.70
u16//Std/3333/2242 180 ns 131 ns 106 ns 1.37 1.24 1.70
u16//Std/3332/1666 184 ns 84.0 ns 83.6 ns 2.19 1.00 2.20
u16//Std/3333/1111 185 ns 132 ns 86.5 ns 1.40 1.53 2.14
u16//Std/3333/501 184 ns 170 ns 143 ns 1.08 1.19 1.29
u16//Std/3333/3300 179 ns 61.9 ns 61.7 ns 2.89 1.00 2.90
u16//Std/3333/12 166 ns 46.8 ns 46.3 ns 3.55 1.01 3.59
u16//Std/3333/5 176 ns 54.3 ns 53.6 ns 3.24 1.01 3.28
u16//Std/3333/1 176 ns 53.4 ns 53.8 ns 3.30 0.99 3.27
u16//Std/333/101 27.4 ns 13.0 ns 11.9 ns 2.11 1.09 2.30
u16//Std/123/32 16.5 ns 11.8 ns 10.5 ns 1.40 1.12 1.57
u16//Std/23/7 11.5 ns 4.93 ns 5.14 ns 2.33 0.96 2.24
u16//Std/12/5 11.9 ns 5.15 ns 4.94 ns 2.31 1.04 2.41
u16//Std/3/2 3.33 ns 4.73 ns 4.68 ns 0.70 1.01 0.71
u16//Rng/3333/2242 180 ns 129 ns 104 ns 1.40 1.24 1.73
u16//Rng/3332/1666 185 ns 82.5 ns 84.0 ns 2.24 0.98 2.20
u16//Rng/3333/1111 183 ns 112 ns 87.9 ns 1.63 1.27 2.08
u16//Rng/3333/501 182 ns 167 ns 146 ns 1.09 1.14 1.25
u16//Rng/3333/3300 181 ns 61.2 ns 63.9 ns 2.96 0.96 2.83
u16//Rng/3333/12 167 ns 46.4 ns 47.5 ns 3.60 0.98 3.52
u16//Rng/3333/5 176 ns 53.3 ns 53.4 ns 3.30 1.00 3.30
u16//Rng/3333/1 175 ns 53.6 ns 54.8 ns 3.26 0.98 3.19
u16//Rng/333/101 27.0 ns 13.3 ns 11.8 ns 2.03 1.13 2.29
u16//Rng/123/32 16.5 ns 11.8 ns 10.9 ns 1.40 1.08 1.51
u16//Rng/23/7 11.9 ns 4.92 ns 5.04 ns 2.42 0.98 2.36
u16//Rng/12/5 12.4 ns 5.15 ns 5.35 ns 2.41 0.96 2.32
u16//Rng/3/2 3.34 ns 4.73 ns 4.85 ns 0.71 0.98 0.69
u32//Std/3333/2242 337 ns 258 ns 206 ns 1.31 1.25 1.64
u32//Std/3332/1666 343 ns 169 ns 169 ns 2.03 1.00 2.03
u32//Std/3333/1111 339 ns 206 ns 152 ns 1.65 1.36 2.23
u32//Std/3333/501 336 ns 310 ns 265 ns 1.08 1.17 1.27
u32//Std/3333/3300 340 ns 106 ns 110 ns 3.21 0.96 3.09
u32//Std/3333/12 337 ns 90.5 ns 93.1 ns 3.72 0.97 3.62
u32//Std/3333/5 333 ns 89.7 ns 92.4 ns 3.71 0.97 3.60
u32//Std/3333/1 331 ns 90.8 ns 92.7 ns 3.65 0.98 3.57
u32//Std/333/101 35.3 ns 16.3 ns 16.9 ns 2.17 0.96 2.09
u32//Std/123/32 14.5 ns 12.1 ns 11.2 ns 1.20 1.08 1.29
u32//Std/23/7 11.4 ns 6.89 ns 7.11 ns 1.65 0.97 1.60
u32//Std/12/5 8.91 ns 6.77 ns 7.04 ns 1.32 0.96 1.27
u32//Std/3/2 3.12 ns 4.68 ns 4.76 ns 0.67 0.98 0.66
u32//Rng/3333/2242 331 ns 252 ns 204 ns 1.31 1.24 1.62
u32//Rng/3332/1666 341 ns 164 ns 167 ns 2.08 0.98 2.04
u32//Rng/3333/1111 335 ns 202 ns 148 ns 1.66 1.36 2.26
u32//Rng/3333/501 341 ns 306 ns 266 ns 1.11 1.15 1.28
u32//Rng/3333/3300 336 ns 106 ns 109 ns 3.17 0.97 3.08
u32//Rng/3333/12 332 ns 90.8 ns 96.3 ns 3.66 0.94 3.45
u32//Rng/3333/5 335 ns 88.8 ns 99.1 ns 3.77 0.90 3.38
u32//Rng/3333/1 332 ns 89.3 ns 92.8 ns 3.72 0.96 3.58
u32//Rng/333/101 35.5 ns 16.3 ns 17.1 ns 2.18 0.95 2.08
u32//Rng/123/32 14.5 ns 12.5 ns 10.9 ns 1.16 1.15 1.33
u32//Rng/23/7 11.3 ns 7.03 ns 7.21 ns 1.61 0.98 1.57
u32//Rng/12/5 9.03 ns 7.37 ns 7.19 ns 1.23 1.03 1.26
u32//Rng/3/2 3.08 ns 4.68 ns 4.74 ns 0.66 0.99 0.65
u64//Std/3333/2242 661 ns 436 ns 333 ns 1.52 1.31 1.98
u64//Std/3332/1666 670 ns 325 ns 332 ns 2.06 0.98 2.02
u64//Std/3333/1111 596 ns 392 ns 281 ns 1.52 1.40 2.12
u64//Std/3333/501 659 ns 581 ns 506 ns 1.13 1.15 1.30
u64//Std/3333/3300 668 ns 207 ns 227 ns 3.23 0.91 2.94
u64//Std/3333/12 655 ns 134 ns 134 ns 4.89 1.00 4.89
u64//Std/3333/5 661 ns 175 ns 186 ns 3.78 0.94 3.55
u64//Std/3333/1 661 ns 182 ns 183 ns 3.63 0.99 3.61
u64//Std/333/101 63.2 ns 48.7 ns 39.4 ns 1.30 1.24 1.60
u64//Std/123/32 22.0 ns 13.5 ns 11.9 ns 1.63 1.13 1.85
u64//Std/23/7 11.3 ns 11.2 ns 9.53 ns 1.01 1.18 1.19
u64//Std/12/5 11.9 ns 10.6 ns 9.53 ns 1.12 1.11 1.25
u64//Std/3/2 3.11 ns 4.68 ns 4.78 ns 0.66 0.98 0.65
u64//Rng/3333/2242 659 ns 435 ns 328 ns 1.51 1.33 2.01
u64//Rng/3332/1666 671 ns 325 ns 326 ns 2.06 1.00 2.06
u64//Rng/3333/1111 596 ns 391 ns 286 ns 1.52 1.37 2.08
u64//Rng/3333/501 668 ns 583 ns 506 ns 1.15 1.15 1.32
u64//Rng/3333/3300 665 ns 206 ns 233 ns 3.23 0.88 2.85
u64//Rng/3333/12 668 ns 133 ns 135 ns 5.02 0.99 4.95
u64//Rng/3333/5 661 ns 175 ns 178 ns 3.78 0.98 3.71
u64//Rng/3333/1 659 ns 182 ns 184 ns 3.62 0.99 3.58
u64//Rng/333/101 62.3 ns 48.4 ns 39.8 ns 1.29 1.22 1.57
u64//Rng/123/32 22.2 ns 13.6 ns 12.4 ns 1.63 1.10 1.79
u64//Rng/23/7 11.2 ns 11.4 ns 9.91 ns 0.98 1.15 1.13
u64//Rng/12/5 11.7 ns 10.7 ns 9.97 ns 1.09 1.07 1.17
u64//Rng/3/2 3.04 ns 4.66 ns 4.66 ns 0.65 1.00 0.65
c6//Std/3333/2242 1742 ns 363 ns 290 ns 4.80 1.25 6.01
c6//Std/3332/1666 1733 ns 244 ns 246 ns 7.10 0.99 7.04
c6//Std/3333/1111 1756 ns 323 ns 250 ns 5.44 1.29 7.02
c6//Std/3333/501 1750 ns 477 ns 411 ns 3.67 1.16 4.26
c6//Std/3333/3300 1740 ns 162 ns 164 ns 10.74 0.99 10.61
c6//Std/3333/12 1734 ns 132 ns 133 ns 13.14 0.99 13.04
c6//Std/3333/5 1826 ns 152 ns 155 ns 12.01 0.98 11.78
c6//Std/3333/1 1733 ns 154 ns 154 ns 11.25 1.00 11.25
c6//Std/333/101 180 ns 46.6 ns 38.4 ns 3.86 1.21 4.69
c6//Std/123/32 66.6 ns 13.6 ns 12.1 ns 4.90 1.12 5.50
c6//Std/23/7 12.2 ns 11.0 ns 9.32 ns 1.11 1.18 1.31
c6//Std/12/5 7.16 ns 7.26 ns 7.33 ns 0.99 0.99 0.98
c6//Std/3/2 2.10 ns 5.02 ns 5.10 ns 0.42 0.98 0.41
c6//Rng/3333/2242 1747 ns 363 ns 291 ns 4.81 1.25 6.00
c6//Rng/3332/1666 1736 ns 243 ns 247 ns 7.14 0.98 7.03
c6//Rng/3333/1111 1726 ns 323 ns 247 ns 5.34 1.31 6.99
c6//Rng/3333/501 1746 ns 476 ns 409 ns 3.67 1.16 4.27
c6//Rng/3333/3300 1741 ns 163 ns 164 ns 10.68 0.99 10.62
c6//Rng/3333/12 1728 ns 133 ns 135 ns 12.99 0.99 12.80
c6//Rng/3333/5 1829 ns 155 ns 157 ns 11.80 0.99 11.65
c6//Rng/3333/1 1724 ns 154 ns 157 ns 11.19 0.98 10.98
c6//Rng/333/101 178 ns 46.7 ns 38.5 ns 3.81 1.21 4.62
c6//Rng/123/32 66.0 ns 14.1 ns 12.7 ns 4.68 1.11 5.20
c6//Rng/23/7 12.3 ns 11.4 ns 10.0 ns 1.08 1.14 1.23
c6//Rng/12/5 7.05 ns 7.44 ns 7.78 ns 0.95 0.96 0.91
c6//Rng/3/2 2.10 ns 5.14 ns 5.54 ns 0.41 0.93 0.38
u8//Std/35000/520 785 ns 797 ns 598 ns 0.98 1.33 1.31
u8//Std/35000/3000 725 ns 759 ns 583 ns 0.96 1.30 1.24

@AlexGuteniev AlexGuteniev requested a review from a team as a code owner May 18, 2025 15:50
@github-project-automation github-project-automation bot moved this to Initial Review in STL Code Reviews May 18, 2025
@StephanTLavavej StephanTLavavej added the performance Must go faster label May 18, 2025
@StephanTLavavej StephanTLavavej self-assigned this May 18, 2025
@StephanTLavavej StephanTLavavej removed their assignment May 22, 2025
@StephanTLavavej StephanTLavavej moved this from Initial Review to Ready To Merge in STL Code Reviews May 22, 2025
@StephanTLavavej
Copy link
Member

Thanks! 😻 This was easier to review once I realized it was mostly copied/moved code slightly modified. I pushed trivial changes.

@StephanTLavavej StephanTLavavej moved this from Ready To Merge to Merging in STL Code Reviews May 22, 2025
@StephanTLavavej
Copy link
Member

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

@StephanTLavavej StephanTLavavej merged commit d38c194 into microsoft:main May 22, 2025
40 checks passed
@github-project-automation github-project-automation bot moved this from Merging to Done in STL Code Reviews May 22, 2025
@StephanTLavavej
Copy link
Member

When things spin faster, science happens faster! 🚀 ⏱️ 😹

@AlexGuteniev AlexGuteniev deleted the rotate-big branch May 22, 2025 20:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

2 participants