Skip to content

Conversation

lovedheart
Copy link

@lovedheart lovedheart commented Aug 16, 2025

Modified code (branchless) for e8m0_to_fp32

Running

.\llama-bench.exe -fa 1 -n 128 -p 0 -r 5 --prio 1 -m T:\Qwen3-4B-instruct-2507-MXFP4.gguf -m T:\gpt-oss-20b-GGUF\gpt-oss-20b-MXFP4.gguf -m T:\Qwen3-30B-A3B-Thinking_All_MXFP4.gguf -ngl 99 -v -ot "token_embd.weight=Vulkan0"

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon 780M Graphics (AMD proprietary driver) | uma: 1 | fp16: 1 | bf16: 1 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat
register_backend: registered backend Vulkan (1 devices)
register_device: registered device Vulkan0 (AMD Radeon 780M Graphics)
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (AMD Ryzen 7 PRO 8845HS w/ Radeon 780M Graphics)

Benchmark Results

Model Size Params Backend NGL FA Test Throughput (t/s)
Master (b182)
gpt-oss ?B MXFP4 MoE 11.27 GiB 20.91 B Vulkan 99 1 tg128 21.90 ± 0.04
qwen3 4B MXFP4 MoE 1.99 GiB 4.02 B Vulkan 99 1 tg128 24.38 ± 0.02
qwen3moe 30B.A3B MXFP4 MoE 15.15 GiB 30.53 B Vulkan 99 1 tg128 25.86 ± 0.12
PR (1af9e2e)
gpt-oss ?B MXFP4 MoE 11.27 GiB 20.91 B Vulkan 99 1 tg128 22.83 ± 0.15
qwen3 4B MXFP4 MoE 1.99 GiB 4.02 B Vulkan 99 1 tg128 26.44 ± 0.12
qwen3moe 30B.A3B MXFP4 MoE 15.15 GiB 30.53 B Vulkan 99 1 tg128 27.05 ± 0.04

Model Size Params Backend NGL FA Test Throughput (t/s)
Master (b188)
gpt-oss ?B MXFP4 MoE 11.27 GiB 20.91 B Vulkan 99 1 tg128 22.20 ± 0.09
qwen3 4B MXFP4 MoE 1.99 GiB 4.02 B Vulkan 99 1 tg128 24.10 ± 0.04
qwen3moe 30B.A3B MXFP4 MoE 15.15 GiB 30.53 B Vulkan 99 1 tg128 25.08 ± 0.31
PR (ef32c83)
gpt-oss ?B MXFP4 MoE 11.27 GiB 20.91 B Vulkan 99 1 tg128 23.18 ± 0.09
qwen3 4B MXFP4 MoE 1.99 GiB 4.02 B Vulkan 99 1 tg128 26.16 ± 0.05
qwen3moe 30B.A3B MXFP4 MoE 15.15 GiB 30.53 B Vulkan 99 1 tg128 26.60 ± 0.36

Note that for testing purpose the weights are all quantized to mxfp4 in both qwen3 4B MXFP4 and qwen3moe 30B.A3B MXFP4 MoE.

Modified code (branchless) for e8m0_to_fp32
@lovedheart lovedheart requested a review from 0cc4m as a code owner August 16, 2025 21:22
@github-actions github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Aug 16, 2025
@jeffbolznv
Copy link
Collaborator

Are you able to see the generated code for your GPU? I'm surprised if the compiler isn't flattening this. But to see such a big change I guess something must have been going wrong...

On NVIDIA, I don't see a perf delta from your change, though it does generate more instructions (I think four per use rather than three). Can you try the following, this should be three instructions (compare, shift, select) on NVIDIA and likely elsewhere too:

float e8m0_to_fp32(uint32_t x) {
    return x == 0 ? uintBitsToFloat(0x00400000) : uintBitsToFloat(x << 23);
}

@lovedheart
Copy link
Author

lovedheart commented Aug 16, 2025

Are you able to see the generated code for your GPU? I'm surprised if the compiler isn't flattening this. But to see such a big change I guess something must have been going wrong...

On NVIDIA, I don't see a perf delta from your change, though it does generate more instructions (I think four per use rather than three). Can you try the following, this should be three instructions (compare, shift, select) on NVIDIA and likely elsewhere too:

float e8m0_to_fp32(uint32_t x) {
    return x == 0 ? uintBitsToFloat(0x00400000) : uintBitsToFloat(x << 23);
}

I've tried similar before.

With your suggested modification, I haven't seen any performance improvement on AMD iGPU.

I tried also

    uint ux = uint(x);	  
	uint mask = uint(x != 0) - 1u;  // x=0: 0xFFFFFFFF, x≠0: 0x00000000
	uint result = ((ux << 23) & ~mask) | (0x00400000u & mask);	  
	return uintBitsToFloat(result);
Model Size Params Backend NGL FA Test Throughput (t/s)
gpt-oss ?B MXFP4 MoE 11.27 GiB 20.91 B Vulkan 99 1 tg128 22.97 ± 0.05
qwen3 4B MXFP4 MoE 1.99 GiB 4.02 B Vulkan 99 1 tg128 26.35 ± 0.18
qwen3moe 30B.A3B MXFP4 MoE 15.15 GiB 30.53 B Vulkan 99 1 tg128 27.04 ± 0.46
	uint is_zero = uint(x == 0);
    uint result = (uint(x) << 23) | (is_zero << 22);
    return uintBitsToFloat(result);
Model Size Params Backend NGL FA Test Throughput (t/s)
gpt-oss ?B MXFP4 MoE 11.27 GiB 20.91 B Vulkan 99 1 tg128 22.94 ± 0.07
qwen3 4B MXFP4 MoE 1.99 GiB 4.02 B Vulkan 99 1 tg128 26.33 ± 0.11
qwen3moe 30B.A3B MXFP4 MoE 15.15 GiB 30.53 B Vulkan 99 1 tg128 27.06 ± 0.09

@TheAIBot
Copy link

TheAIBot commented Aug 17, 2025

(ux << 23) * (1 - is_zero)

the

* (1 - is_zero)

part seems unecessary.

uint result = (uint(x) << 23) | (is_zero << 22);

This option you showed looks clearer. Why not go with it?

@jeffbolznv
Copy link
Collaborator

This latest version seems a bit slower on NVIDIA, though I think it's maybe due to the original code getting particularly lucky in how it's scheduled.

@0cc4m
Copy link
Collaborator

0cc4m commented Aug 17, 2025

It does make a difference in the generated device code for AMD. I used the 780M/gfx1103 target with RGA. Left is old, right is new.

shader_code

Edit: I compiled this code:

#version 450

#extension GL_EXT_shader_explicit_arithmetic_types_int32 : require
#extension GL_EXT_shader_explicit_arithmetic_types_int8 : require

layout (binding = 0) readonly buffer A {uint8_t data_a[];};
layout (binding = 1) writeonly buffer D {float data_d[];};

float e8m0_to_fp32(uint8_t x) {
    // Swap out implementation here
    uint is_zero = uint(x == 0);
    uint result = (uint(x) << 23) | (is_zero << 22);
    return uintBitsToFloat(result);
}

void main()
{
    uint index = gl_GlobalInvocationID.x;
    data_d[index] = e8m0_to_fp32(data_a[index]);
}

I don't see much of any performance difference on RTX 3090, Radeon Pro VII or A770.

@0cc4m
Copy link
Collaborator

0cc4m commented Aug 17, 2025

@jeffbolznv If I wanted to do this for Nvidia, I'd use NSight Graphics, right? I managed to compile the shaders with debug info and run the "GPU Trace Profiler" with Ampere "Top-Level Triage" and "Real-Time Shader Profiler" to get shader source latency info. This is quite nice, but for some reason I can't get it to show me SASS, it just shows two empty lines (I disabled interleaving here):

image

Is that expected, am I doing something wrong, and is there a different/better way to look at shader source, device code and performance metrics?

@jeffbolznv
Copy link
Collaborator

I think SASS requires Nsight Pro (e.g. see https://forums.developer.nvidia.com/t/nsight-graphics-pro-build/218965) which requires an NDA. If this is something you're interested in, I can try to connect you to the right people.

@0cc4m
Copy link
Collaborator

0cc4m commented Aug 17, 2025

Oh, I wasn't aware there was another version of NSight. I don't think I currently need it, but I'll keep it in mind. Some kind of hint in the program that not showing SASS is expected would be good, it looked like a bug to me.

@netrunnereve
Copy link
Collaborator

It does make a difference in the generated device code for AMD. I used the 780M/gfx1103 target with RGA. Left is old, right is new.

Huh line 12 on the left side looks like a compiler bug as an 8 bit unsigned load shouldn't require any masking. If we get rid of that unnecessary instruction than that code should be just as fast as the one on the right.

@characharm
Copy link
Contributor

characharm commented Aug 17, 2025

9070xt PR

model size params backend ngl fa test t/s
gpt-oss ?B MXFP4 MoE 11.27 GiB 20.91 B Vulkan 99 1 tg128 152.08 ± 0.66

b6178

model size params backend ngl fa test t/s
gpt-oss ?B MXFP4 MoE 11.27 GiB 20.91 B RPC,Vulkan 99 1 tg128 146.16 ± 2.89

current master:

model size params backend ngl fa test t/s
gpt-oss ?B MXFP4 MoE 11.27 GiB 20.91 B RPC,Vulkan 99 1 tg128 153.47 ± 1.40

master + pr:

model size params backend ngl fa test t/s
gpt-oss ?B MXFP4 MoE 11.27 GiB 20.91 B Vulkan 99 1 tg128 154.36 ± 0.63

@lovedheart
Copy link
Author

@characharm Thanks for your testing. It looks like that the bitweise operations are highly optimized in RDNA3/RDNA4 GPU.

@0cc4m
Copy link
Collaborator

0cc4m commented Aug 19, 2025

From my side this is fine, but you have to get rid of the commented out code. Any concerns with this change @jeffbolznv ?

@0cc4m
Copy link
Collaborator

0cc4m commented Aug 19, 2025

There's no need to merge or rebase regularly, only if you need to resolve a merge conflict.

@jeffbolznv
Copy link
Collaborator

If the change helps consistently across AMD GPUs then I can live with it. But it is about a 1% slowdown in prompt processing for the gpt-oss model on NVIDIA.

@netrunnereve
Copy link
Collaborator

netrunnereve commented Aug 19, 2025

Okay on my 470 with RADV this PR is around 1.5% slower. I checked the assembly and for master the compiler basically does:

v25 = x
v46 = 0x00400000
v_lshlrev_b32_e32	 v24, 23, v25 # left shift x by 23
v_cmp_eq_i32_e32	 vcc, 0, v25 # compare x with 0
v_cndmask_b32_e32	 v24, v24, v46, vcc # selects result based off comparison

That's three instructions. With the PR:

v25 = x
v_cmp_eq_i32_e32 	 vcc, 0, v25 # compare x with 0
v_lshlrev_b32_e32	 v25, 23, v25 # left shift x by 23
v_cndmask_b32_e64	 v32, 0, 1, vcc # select a 0 or 1 depending on comparison
v_lshlrev_b32_e32	 v32, 22, v32 # shift is_zero by 22
v_or_b32_e32     	 v25, v25, v32 # or them together

That's five instructions, and RADV isn't as smart as the AMD proprietary driver since it doesn't replace the is_zero shift with a constant. Now even if that was taken care of that's still four instructions as I don't have the combined shift and or instruction that the new RDNA chips have.

I don't particularly care about that 1.5% loss in performance but I'm not a fan of slowing others down to work around a driver bug that's throwing in an extra instruction for no reason.

BTW @characharm what driver did you run your tests on?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants