Skip to content

Conversation

CISC
Copy link
Collaborator

@CISC CISC commented Sep 28, 2025

CUDA and Metal support only a very limited column width for argsort, check so we can fall back to CPU.

Edit: Looks like Vulkan is limited too, but already checks:
https://github.com/ggml-org/llama.cpp/actions/runs/18081472255/job/51445345624?pr=16323#step:3:10561

OpenCL checks:

return cols <= max_workgroup_size && op->src[0]->type == GGML_TYPE_F32;

@qnixsynapse @NeoZhangJianyu SYCL is probably limited too, but does not check.

@CISC CISC requested a review from slaren as a code owner September 28, 2025 23:43
@github-actions github-actions bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Sep 28, 2025
@CISC CISC requested a review from ggerganov as a code owner September 29, 2025 00:10
@github-actions github-actions bot added the Apple Metal https://en.wikipedia.org/wiki/Metal_(API) label Sep 29, 2025
@CISC CISC changed the title cuda : check cuda argsort limits and add test ggml : check cuda and metal argsort limits and add test Sep 29, 2025
@jeffbolznv
Copy link
Collaborator

Is the model really using 16k for ne[0]? That's significantly bigger than what I've seen before and at that size we wouldn't be able to use shared memory on all hardware.

@CISC
Copy link
Collaborator Author

CISC commented Sep 29, 2025

Is the model really using 16k for ne[0]? That's significantly bigger than what I've seen before and at that size we wouldn't be able to use shared memory on all hardware.

Depends on the input, it will be n_tokens * (n_expert / 8).

@CISC
Copy link
Collaborator Author

CISC commented Sep 29, 2025

Is the model really using 16k for ne[0]? That's significantly bigger than what I've seen before and at that size we wouldn't be able to use shared memory on all hardware.

Depends on the input, it will be n_tokens * (n_expert / 8).

I'm thinking I should bypass the first top_k somehow (finding the top 2 tokens in each group) in the group selection as this could get quite slow...

@jeffbolznv
Copy link
Collaborator

If you only need the indices of the top 2, then yeah a full sort is overkill. Seems like you'd want a modified argmax...

@CISC
Copy link
Collaborator Author

CISC commented Sep 29, 2025

If you only need the indices of the top 2, then yeah a full sort is overkill. Seems like you'd want a modified argmax...

Hmmm, an argmax2 or even argmaxn could be nice.

@CISC CISC merged commit adc7634 into master Sep 29, 2025
64 of 67 checks passed
@CISC CISC deleted the cisc/check-cuda-argsort-limits branch September 29, 2025 09:09
Comment on lines -3642 to +3646
case GGML_OP_ARGSORT:
case GGML_OP_ACC:
return true;
case GGML_OP_ARGSORT:
// TODO: Support arbitrary column width
return op->src[0]->ne[0] <= 1024;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Preferably keep the order of ggml ops in switch statements consistent with the order in which they're being declared in ggml.h.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't reflect much on the order, just didn't want to unnecessarily break up the fall-throughs, I'll keep it in mind for the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apple Metal https://en.wikipedia.org/wiki/Metal_(API) ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants