-
Notifications
You must be signed in to change notification settings - Fork 13.2k
vulkan: optimize mxfp4 #15363
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
vulkan: optimize mxfp4 #15363
Conversation
Modified code (branchless) for e8m0_to_fp32
Are you able to see the generated code for your GPU? I'm surprised if the compiler isn't flattening this. But to see such a big change I guess something must have been going wrong... On NVIDIA, I don't see a perf delta from your change, though it does generate more instructions (I think four per use rather than three). Can you try the following, this should be three instructions (compare, shift, select) on NVIDIA and likely elsewhere too:
|
I've tried similar before. With your suggested modification, I haven't seen any performance improvement on AMD iGPU. I tried also
|
(ux << 23) * (1 - is_zero) the * (1 - is_zero) part seems unecessary. uint result = (uint(x) << 23) | (is_zero << 22); This option you showed looks clearer. Why not go with it? |
This latest version seems a bit slower on NVIDIA, though I think it's maybe due to the original code getting particularly lucky in how it's scheduled. |
@jeffbolznv If I wanted to do this for Nvidia, I'd use NSight Graphics, right? I managed to compile the shaders with debug info and run the "GPU Trace Profiler" with Ampere "Top-Level Triage" and "Real-Time Shader Profiler" to get shader source latency info. This is quite nice, but for some reason I can't get it to show me SASS, it just shows two empty lines (I disabled interleaving here): ![]() Is that expected, am I doing something wrong, and is there a different/better way to look at shader source, device code and performance metrics? |
I think SASS requires Nsight Pro (e.g. see https://forums.developer.nvidia.com/t/nsight-graphics-pro-build/218965) which requires an NDA. If this is something you're interested in, I can try to connect you to the right people. |
Oh, I wasn't aware there was another version of NSight. I don't think I currently need it, but I'll keep it in mind. Some kind of hint in the program that not showing SASS is expected would be good, it looked like a bug to me. |
Huh line 12 on the left side looks like a compiler bug as an 8 bit unsigned load shouldn't require any masking. If we get rid of that unnecessary instruction than that code should be just as fast as the one on the right. |
9070xt PR
current master:
master + pr:
|
@characharm Thanks for your testing. It looks like that the bitweise operations are highly optimized in RDNA3/RDNA4 GPU. |
From my side this is fine, but you have to get rid of the commented out code. Any concerns with this change @jeffbolznv ? |
There's no need to merge or rebase regularly, only if you need to resolve a merge conflict. |
If the change helps consistently across AMD GPUs then I can live with it. But it is about a 1% slowdown in prompt processing for the gpt-oss model on NVIDIA. |
Okay on my 470 with RADV this PR is around 1.5% slower. I checked the assembly and for master the compiler basically does:
That's three instructions. With the PR:
That's five instructions, and RADV isn't as smart as the AMD proprietary driver since it doesn't replace the is_zero shift with a constant. Now even if that was taken care of that's still four instructions as I don't have the combined shift and or instruction that the new RDNA chips have. I don't particularly care about that 1.5% loss in performance but I'm not a fan of slowing others down to work around a driver bug that's throwing in an extra instruction for no reason. BTW @characharm what driver did you run your tests on? |
Modified code (branchless) for e8m0_to_fp32
Running
.\llama-bench.exe -fa 1 -n 128 -p 0 -r 5 --prio 1 -m T:\Qwen3-4B-instruct-2507-MXFP4.gguf -m T:\gpt-oss-20b-GGUF\gpt-oss-20b-MXFP4.gguf -m T:\Qwen3-30B-A3B-Thinking_All_MXFP4.gguf -ngl 99 -v -ot "token_embd.weight=Vulkan0"
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon 780M Graphics (AMD proprietary driver) | uma: 1 | fp16: 1 | bf16: 1 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat
register_backend: registered backend Vulkan (1 devices)
register_device: registered device Vulkan0 (AMD Radeon 780M Graphics)
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (AMD Ryzen 7 PRO 8845HS w/ Radeon 780M Graphics)
Benchmark Results
Note that for testing purpose the weights are all quantized to mxfp4 in both qwen3 4B MXFP4 and qwen3moe 30B.A3B MXFP4 MoE.