Skip to content

Argsort on int32_t severely underperforms #109

@KungFuJesus

Description

@KungFuJesus

Hello, I just want to make sure I'm not doing something wrong. I stumbled upon this project when I found that IPP's radix sort had buffer size calculations that were overflowing well before the limits of an INT32_MAX and all of the "Index" variants also only used 32 bit types. To be clear, a 32 bit index variant still would be beneficial and I see there are plans to eventually implement that. It's just I need a conditional code path for when there are too many values (I have a dataset generated that has length in the billions). I noticed that when the code makes use of the AVX512 instructions for argsort, it severely under performs against the radix sort (which as far as I can tell from the assembly, is completely scalar except for an iota like scan to generate the sequential index).

The benchmarks distributed with this project do show it winning by some margins against the scalar variants in here but not by the typical order of magnitudes that were touted when this project was first introduced. Is this maybe a result of the gather data sampling mitigations in microcode? I know this heavily uses compressed store and gather, and those functions were severely limited by this.

Here's my benchmark on my hardware:

[astylinski@fedoravm builddir]$ ./benchexe --benchmark_filter=/int32_t
2023-12-03T11:30:05-05:00
Running ./benchexe
Run on (6 X 3312 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x6)
  L1 Instruction 32 KiB (x6)
  L2 Unified 4096 KiB (x6)
  L3 Unified 16384 KiB (x1)
Load Average: 1.02, 0.68, 0.32
--------------------------------------------------------------------------------
Benchmark                                      Time             CPU   Iterations
--------------------------------------------------------------------------------
simdargsort/random_128/int32_t               959 ns          956 ns       731536
simdargsort/random_256/int32_t              2045 ns         2040 ns       343034
simdargsort/random_512/int32_t              3866 ns         3857 ns       181511
simdargsort/random_1k/int32_t               9289 ns         9267 ns        75493
simdargsort/random_5k/int32_t              50900 ns        50768 ns        13789
simdargsort/random_100k/int32_t          1764351 ns      1757047 ns          400
simdargsort/random_1m/int32_t           29256670 ns     29111150 ns           24
simdargsort/random_10m/int32_t         959237861 ns    954095660 ns            1
simdargsort/smallrange_128/int32_t           722 ns          720 ns       975799
simdargsort/smallrange_256/int32_t          2049 ns         2044 ns       342384
simdargsort/smallrange_512/int32_t          3655 ns         3646 ns       191985
simdargsort/smallrange_1k/int32_t           9169 ns         9147 ns        76497
simdargsort/smallrange_5k/int32_t          21375 ns        21320 ns        32747
simdargsort/smallrange_100k/int32_t       467703 ns       465853 ns         1502
simdargsort/smallrange_1m/int32_t        9097255 ns      9047326 ns           75
simdargsort/smallrange_10m/int32_t     183596592 ns    182645622 ns            4
simdargsort/sorted_10k/int32_t            104574 ns       104314 ns         6668
simdargsort/constant_10k/int32_t            8648 ns         8628 ns        81134
simdargsort/reverse_10k/int32_t           105977 ns       105722 ns         6558
scalarargsort/random_128/int32_t             797 ns          795 ns       882026
scalarargsort/random_256/int32_t            1736 ns         1732 ns       403901
scalarargsort/random_512/int32_t            4000 ns         3990 ns       176210
scalarargsort/random_1k/int32_t            21879 ns        21822 ns        31937
scalarargsort/random_5k/int32_t           262249 ns       261537 ns         2676
scalarargsort/random_100k/int32_t        7551632 ns      7525002 ns           93
scalarargsort/random_1m/int32_t        103352847 ns    102895746 ns            7
scalarargsort/random_10m/int32_t      1974523438 ns   1964054017 ns            1
scalarargsort/smallrange_128/int32_t         667 ns          665 ns      1051997
scalarargsort/smallrange_256/int32_t        1498 ns         1494 ns       469065
scalarargsort/smallrange_512/int32_t        3152 ns         3144 ns       223444
scalarargsort/smallrange_1k/int32_t        13730 ns        13697 ns        50943
scalarargsort/smallrange_5k/int32_t       127606 ns       127285 ns         5499
scalarargsort/smallrange_100k/int32_t    3031465 ns      3019854 ns          232
scalarargsort/smallrange_1m/int32_t     38177852 ns     37995219 ns           18
scalarargsort/smallrange_10m/int32_t   695700305 ns    692244521 ns            1
scalarargsort/sorted_10k/int32_t           92423 ns        92202 ns         7578
scalarargsort/constant_10k/int32_t         83139 ns        82931 ns         8439
scalarargsort/reverse_10k/int32_t          74214 ns        74031 ns         9359

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions