Skip to content

Conversation

r-devulap
Copy link
Member

@r-devulap r-devulap commented May 15, 2025

This pull request refactors the logic for determining the number of threads used in parallel sorting functions by introducing a new helper function, xss_get_num_threads. This change ensures consistency and simplifies thread count management across multiple sorting implementations.

@r-devulap r-devulap requested a review from Copilot May 15, 2025 16:53
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a new helper function, xss_get_num_threads, to determine the optimal number of threads based on the number of physical CPU cores, replacing hardcoded thread limits.

  • Replaced hardcoded thread limits with xss_get_num_threads in multiple qsort and argsort functions.
  • Added a new helper function in src/xss-common-includes.h to calculate the thread count.

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/xss-common-qsort.h Removed hardcoded thread limit and now calls xss_get_num_threads().
src/xss-common-keyvaluesort.hpp Removed hardcoded thread limit and now calls xss_get_num_threads().
src/xss-common-argsort.h Removed hardcoded thread limit and now calls xss_get_num_threads().
src/avx512-16bit-qsort.hpp Removed hardcoded thread limit and now calls xss_get_num_threads().
src/xss-common-includes.h Introduced xss_get_num_threads to calculate thread count based on CPU cores.

{
// Get the number of physical cores: works only when hyperthreading is
// enabled on all cores
int num_physical_cores = omp_get_num_procs() / 2;
Copy link
Preview

Copilot AI May 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment indicates that this logic assumes hyperthreading is enabled on all cores. Consider expanding the documentation to clarify that on systems without hyperthreading this calculation may underutilize available cores or adjust the calculation for more robust behavior.

Copilot uses AI. Check for mistakes.

Copy link
Contributor

@sterrettm2 sterrettm2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This patch seems to cause a major regression on my SPR system for 1m and 10m sizes, despite using far more CPU. 100m benefits a bit, and interestingly has lower CPU time as well:

1m
Comparing simdsort/random_1m (from /home/sterrettm/xss/x86-simd-sort/.bench/main/builddir/benchexe) to simdsort/random_1m (from /home/sterrettm/xss/x86-simd-sort/.bench/raghu_omp/builddir/benchexe)
Benchmark                                                              Time             CPU      Time Old      Time New       CPU Old       CPU New
---------------------------------------------------------------------------------------------------------------------------------------------------
[simdsort/random_1m vs. simdsort/random_1m]/uint64_t                +0.6633         +0.6638       1796738       2988555       1796346       2988695
[simdsort/random_1m vs. simdsort/random_1m]/int64_t                 +0.6111         +0.6112       1813152       2921096       1812886       2920974
[simdsort/random_1m vs. simdsort/random_1m]/uint32_t                +1.4261         +1.4277        774474       1878973        774148       1879380
[simdsort/random_1m vs. simdsort/random_1m]/int32_t                 +1.3888         +1.3907        789581       1886162        789146       1886582
[simdsort/random_1m vs. simdsort/random_1m]/uint16_t                +1.4335         +1.4344        659436       1604750        659213       1604815
[simdsort/random_1m vs. simdsort/random_1m]/int16_t                 +1.4339         +1.4343        661424       1609820        661205       1609553
[simdsort/random_1m vs. simdsort/random_1m]/float                   +1.4058         +1.4067        785036       1888669        784693       1888504
[simdsort/random_1m vs. simdsort/random_1m]/double                  +0.7759         +0.7764       1398137       2482994       1397812       2483070
[simdsort/random_1m vs. simdsort/random_1m]/_Float16                +0.9243         +0.9252       1084081       2086116       1083642       2086215
OVERALL_GEOMEAN                                                     +1.0886         +1.0894             0             0             0             0

10m
Comparing simdsort/random_10m (from /home/sterrettm/xss/x86-simd-sort/.bench/main/builddir/benchexe) to simdsort/random_10m (from /home/sterrettm/xss/x86-simd-sort/.bench/raghu_omp/builddir/benchexe)
Benchmark                                                                Time             CPU      Time Old      Time New       CPU Old       CPU New
-----------------------------------------------------------------------------------------------------------------------------------------------------
[simdsort/random_10m vs. simdsort/random_10m]/uint64_t                -0.0393         -0.1954      25032500      24047678      25031503      20141074
[simdsort/random_10m vs. simdsort/random_10m]/int64_t                 -0.0184         -0.1239      23815509      23376537      23802992      20853016
[simdsort/random_10m vs. simdsort/random_10m]/uint32_t                +0.5229         +0.5236       7872549      11989459       7869377      11989398
[simdsort/random_10m vs. simdsort/random_10m]/int32_t                 +0.5457         +0.5460       7929956      12257078       7927239      12255340
[simdsort/random_10m vs. simdsort/random_10m]/uint16_t                +0.3530         +0.3530       5700344       7712651       5700039       7712366
[simdsort/random_10m vs. simdsort/random_10m]/int16_t                 +0.3061         +0.3060       5710822       7459049       5710548       7457834
[simdsort/random_10m vs. simdsort/random_10m]/float                   +0.5283         +0.5281       7984395      12202217       7984079      12200205
[simdsort/random_10m vs. simdsort/random_10m]/double                  -0.0092         -0.1719      22430352      22224402      22429597      18573556
[simdsort/random_10m vs. simdsort/random_10m]/_Float16                +0.6462         +0.6463       4400445       7243948       4399974       7243662
OVERALL_GEOMEAN                                                       +0.2883         +0.2228             0             0             0             0

100m
Comparing simdsort/random_100m (from /home/sterrettm/xss/x86-simd-sort/.bench/main/builddir/benchexe) to simdsort/random_100m (from /home/sterrettm/xss/x86-simd-sort/.bench/raghu_omp/builddir/benchexe)
Benchmark                                                                  Time             CPU      Time Old      Time New       CPU Old       CPU New
-------------------------------------------------------------------------------------------------------------------------------------------------------
[simdsort/random_100m vs. simdsort/random_100m]/uint64_t                -0.0767         -0.8039     580935369     536395893     578418465     113435125
[simdsort/random_100m vs. simdsort/random_100m]/int64_t                 -0.0477         -0.6396     586234077     558282477     586229954     211306420
[simdsort/random_100m vs. simdsort/random_100m]/uint32_t                -0.0818         -0.4046     232056239     213083404     231701433     137960312
[simdsort/random_100m vs. simdsort/random_100m]/int32_t                 -0.0364         -0.5359     232368921     223917915     232213279     107775511
[simdsort/random_100m vs. simdsort/random_100m]/uint16_t                -0.0501         -0.4083      80935109      76880782      80918052      47882478
[simdsort/random_100m vs. simdsort/random_100m]/int16_t                 -0.0658         -0.3608      81683614      76308548      81642565      52188516
[simdsort/random_100m vs. simdsort/random_100m]/float                   -0.0288         -0.3741     237212078     230381039     236491140     148012758
[simdsort/random_100m vs. simdsort/random_100m]/double                  +0.0422         -0.4587     573784986     598021062     569874959     308481275
[simdsort/random_100m vs. simdsort/random_100m]/_Float16                -0.0881         -0.2259      81381343      74207809      81373671      62989080
OVERALL_GEOMEAN                                                         -0.0488         -0.5004             0             0             0             0

I am very interested in why the CPU time is negative for some of these tests. I think with so many threads being spawned, maybe OpenMP is no longer having them spin?
It seems like in those cases the performance is improved, and CPU utilization is reduced. If we could force that behavior all the time, maybe setting the thread limit this high could work for smaller sizes?

Copy link
Contributor

@sterrettm2 sterrettm2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! About 11% improvement for 100m, about 21% for 10m, seems like a good idea! I'll merge this once the CI passes.

@sterrettm2 sterrettm2 merged commit 1c6c015 into numpy:main May 16, 2025
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants