vulkan: optimize mxfp4 #15363

lovedheart · 2025-08-16T21:22:57Z

Modified code (branchless) for e8m0_to_fp32

Running

.\llama-bench.exe -fa 1 -n 128 -p 0 -r 5 --prio 1 -m T:\Qwen3-4B-instruct-2507-MXFP4.gguf -m T:\gpt-oss-20b-GGUF\gpt-oss-20b-MXFP4.gguf -m T:\Qwen3-30B-A3B-Thinking_All_MXFP4.gguf -ngl 99 -v -ot "token_embd.weight=Vulkan0"

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon 780M Graphics (AMD proprietary driver) | uma: 1 | fp16: 1 | bf16: 1 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat
register_backend: registered backend Vulkan (1 devices)
register_device: registered device Vulkan0 (AMD Radeon 780M Graphics)
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (AMD Ryzen 7 PRO 8845HS w/ Radeon 780M Graphics)

Benchmark Results

Model	Size	Params	Backend	NGL	FA	Test	Throughput (t/s)
Master (b182)
gpt-oss ?B MXFP4 MoE	11.27 GiB	20.91 B	Vulkan	99	1	tg128	21.90 ± 0.04
qwen3 4B MXFP4 MoE	1.99 GiB	4.02 B	Vulkan	99	1	tg128	24.38 ± 0.02
qwen3moe 30B.A3B MXFP4 MoE	15.15 GiB	30.53 B	Vulkan	99	1	tg128	25.86 ± 0.12
PR (1af9e2e)
gpt-oss ?B MXFP4 MoE	11.27 GiB	20.91 B	Vulkan	99	1	tg128	22.83 ± 0.15
qwen3 4B MXFP4 MoE	1.99 GiB	4.02 B	Vulkan	99	1	tg128	26.44 ± 0.12
qwen3moe 30B.A3B MXFP4 MoE	15.15 GiB	30.53 B	Vulkan	99	1	tg128	27.05 ± 0.04

Model	Size	Params	Backend	NGL	FA	Test	Throughput (t/s)
Master (b188)
gpt-oss ?B MXFP4 MoE	11.27 GiB	20.91 B	Vulkan	99	1	tg128	22.20 ± 0.09
qwen3 4B MXFP4 MoE	1.99 GiB	4.02 B	Vulkan	99	1	tg128	24.10 ± 0.04
qwen3moe 30B.A3B MXFP4 MoE	15.15 GiB	30.53 B	Vulkan	99	1	tg128	25.08 ± 0.31
PR (ef32c83)
gpt-oss ?B MXFP4 MoE	11.27 GiB	20.91 B	Vulkan	99	1	tg128	23.18 ± 0.09
qwen3 4B MXFP4 MoE	1.99 GiB	4.02 B	Vulkan	99	1	tg128	26.16 ± 0.05
qwen3moe 30B.A3B MXFP4 MoE	15.15 GiB	30.53 B	Vulkan	99	1	tg128	26.60 ± 0.36

Note that for testing purpose the weights are all quantized to mxfp4 in both qwen3 4B MXFP4 and qwen3moe 30B.A3B MXFP4 MoE.

Modified code (branchless) for e8m0_to_fp32

jeffbolznv · 2025-08-16T22:25:37Z

Are you able to see the generated code for your GPU? I'm surprised if the compiler isn't flattening this. But to see such a big change I guess something must have been going wrong...

On NVIDIA, I don't see a perf delta from your change, though it does generate more instructions (I think four per use rather than three). Can you try the following, this should be three instructions (compare, shift, select) on NVIDIA and likely elsewhere too:

float e8m0_to_fp32(uint32_t x) {
    return x == 0 ? uintBitsToFloat(0x00400000) : uintBitsToFloat(x << 23);
}

lovedheart · 2025-08-16T22:40:31Z

Are you able to see the generated code for your GPU? I'm surprised if the compiler isn't flattening this. But to see such a big change I guess something must have been going wrong...

On NVIDIA, I don't see a perf delta from your change, though it does generate more instructions (I think four per use rather than three). Can you try the following, this should be three instructions (compare, shift, select) on NVIDIA and likely elsewhere too:
float e8m0_to_fp32(uint32_t x) {
    return x == 0 ? uintBitsToFloat(0x00400000) : uintBitsToFloat(x << 23);
}

I've tried similar before.

With your suggested modification, I haven't seen any performance improvement on AMD iGPU.

I tried also

    uint ux = uint(x);	  
	uint mask = uint(x != 0) - 1u;  // x=0: 0xFFFFFFFF, x≠0: 0x00000000
	uint result = ((ux << 23) & ~mask) | (0x00400000u & mask);	  
	return uintBitsToFloat(result);

Model	Size	Params	Backend	NGL	FA	Test	Throughput (t/s)
gpt-oss ?B MXFP4 MoE	11.27 GiB	20.91 B	Vulkan	99	1	tg128	22.97 ± 0.05
qwen3 4B MXFP4 MoE	1.99 GiB	4.02 B	Vulkan	99	1	tg128	26.35 ± 0.18
qwen3moe 30B.A3B MXFP4 MoE	15.15 GiB	30.53 B	Vulkan	99	1	tg128	27.04 ± 0.46

	uint is_zero = uint(x == 0);
    uint result = (uint(x) << 23) | (is_zero << 22);
    return uintBitsToFloat(result);

Model	Size	Params	Backend	NGL	FA	Test	Throughput (t/s)
gpt-oss ?B MXFP4 MoE	11.27 GiB	20.91 B	Vulkan	99	1	tg128	22.94 ± 0.07
qwen3 4B MXFP4 MoE	1.99 GiB	4.02 B	Vulkan	99	1	tg128	26.33 ± 0.11
qwen3moe 30B.A3B MXFP4 MoE	15.15 GiB	30.53 B	Vulkan	99	1	tg128	27.06 ± 0.09

TheAIBot · 2025-08-17T01:26:52Z

(ux << 23) * (1 - is_zero)

the

* (1 - is_zero)

part seems unecessary.

uint result = (uint(x) << 23) | (is_zero << 22);

This option you showed looks clearer. Why not go with it?

jeffbolznv · 2025-08-17T05:49:03Z

This latest version seems a bit slower on NVIDIA, though I think it's maybe due to the original code getting particularly lucky in how it's scheduled.

0cc4m · 2025-08-17T09:36:58Z

It does make a difference in the generated device code for AMD. I used the 780M/gfx1103 target with RGA. Left is old, right is new.

Edit: I compiled this code:

#version 450

#extension GL_EXT_shader_explicit_arithmetic_types_int32 : require
#extension GL_EXT_shader_explicit_arithmetic_types_int8 : require

layout (binding = 0) readonly buffer A {uint8_t data_a[];};
layout (binding = 1) writeonly buffer D {float data_d[];};

float e8m0_to_fp32(uint8_t x) {
    // Swap out implementation here
    uint is_zero = uint(x == 0);
    uint result = (uint(x) << 23) | (is_zero << 22);
    return uintBitsToFloat(result);
}

void main()
{
    uint index = gl_GlobalInvocationID.x;
    data_d[index] = e8m0_to_fp32(data_a[index]);
}

I don't see much of any performance difference on RTX 3090, Radeon Pro VII or A770.

0cc4m · 2025-08-17T10:45:48Z

@jeffbolznv If I wanted to do this for Nvidia, I'd use NSight Graphics, right? I managed to compile the shaders with debug info and run the "GPU Trace Profiler" with Ampere "Top-Level Triage" and "Real-Time Shader Profiler" to get shader source latency info. This is quite nice, but for some reason I can't get it to show me SASS, it just shows two empty lines (I disabled interleaving here):

Is that expected, am I doing something wrong, and is there a different/better way to look at shader source, device code and performance metrics?

jeffbolznv · 2025-08-17T14:52:58Z

I think SASS requires Nsight Pro (e.g. see https://forums.developer.nvidia.com/t/nsight-graphics-pro-build/218965) which requires an NDA. If this is something you're interested in, I can try to connect you to the right people.

0cc4m · 2025-08-17T16:37:46Z

Oh, I wasn't aware there was another version of NSight. I don't think I currently need it, but I'll keep it in mind. Some kind of hint in the program that not showing SASS is expected would be good, it looked like a bug to me.

netrunnereve · 2025-08-17T17:08:08Z

It does make a difference in the generated device code for AMD. I used the 780M/gfx1103 target with RGA. Left is old, right is new.

Huh line 12 on the left side looks like a compiler bug as an 8 bit unsigned load shouldn't require any masking. If we get rid of that unnecessary instruction than that code should be just as fast as the one on the right.

characharm · 2025-08-17T18:31:54Z

9070xt PR

model	size	params	backend	ngl	fa	test	t/s
gpt-oss ?B MXFP4 MoE	11.27 GiB	20.91 B	Vulkan	99	1	tg128	152.08 ± 0.66

b6178

model	size	params	backend	ngl	fa	test	t/s
gpt-oss ?B MXFP4 MoE	11.27 GiB	20.91 B	RPC,Vulkan	99	1	tg128	146.16 ± 2.89

current master:

model	size	params	backend	ngl	fa	test	t/s
gpt-oss ?B MXFP4 MoE	11.27 GiB	20.91 B	RPC,Vulkan	99	1	tg128	153.47 ± 1.40

master + pr:

model	size	params	backend	ngl	fa	test	t/s
gpt-oss ?B MXFP4 MoE	11.27 GiB	20.91 B	Vulkan	99	1	tg128	154.36 ± 0.63

lovedheart · 2025-08-18T19:18:52Z

@characharm Thanks for your testing. It looks like that the bitweise operations are highly optimized in RDNA3/RDNA4 GPU.

0cc4m · 2025-08-19T06:33:06Z

From my side this is fine, but you have to get rid of the commented out code. Any concerns with this change @jeffbolznv ?

0cc4m · 2025-08-19T11:11:09Z

There's no need to merge or rebase regularly, only if you need to resolve a merge conflict.

jeffbolznv · 2025-08-19T11:28:23Z

If the change helps consistently across AMD GPUs then I can live with it. But it is about a 1% slowdown in prompt processing for the gpt-oss model on NVIDIA.

netrunnereve · 2025-08-19T21:39:36Z

Okay on my 470 with RADV this PR is around 1.5% slower. I checked the assembly and for master the compiler basically does:

v25 = x
v46 = 0x00400000
v_lshlrev_b32_e32	 v24, 23, v25 # left shift x by 23
v_cmp_eq_i32_e32	 vcc, 0, v25 # compare x with 0
v_cndmask_b32_e32	 v24, v24, v46, vcc # selects result based off comparison

That's three instructions. With the PR:

v25 = x
v_cmp_eq_i32_e32 	 vcc, 0, v25 # compare x with 0
v_lshlrev_b32_e32	 v25, 23, v25 # left shift x by 23
v_cndmask_b32_e64	 v32, 0, 1, vcc # select a 0 or 1 depending on comparison
v_lshlrev_b32_e32	 v32, 22, v32 # shift is_zero by 22
v_or_b32_e32     	 v25, v25, v32 # or them together

That's five instructions, and RADV isn't as smart as the AMD proprietary driver since it doesn't replace the is_zero shift with a constant. Now even if that was taken care of that's still four instructions as I don't have the combined shift and or instruction that the new RDNA chips have.

I don't particularly care about that 1.5% loss in performance but I'm not a fan of slowing others down to work around a driver bug that's throwing in an extra instruction for no reason.

BTW @characharm what driver did you run your tests on?

Update types.comp

1af9e2e

Modified code (branchless) for e8m0_to_fp32

lovedheart requested a review from 0cc4m as a code owner August 16, 2025 21:22

github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Aug 16, 2025

Merge branch 'ggml-org:master' into lovedheart-vulkan-mxfp4-optimization

5c3201d

Update types.comp

640e207

lovedheart added 2 commits August 17, 2025 21:16

Merge branch 'ggml-org:master' into lovedheart-vulkan-mxfp4-optimization

ef32c83

Merge branch 'ggml-org:master' into lovedheart-vulkan-mxfp4-optimization

01c04c1

lovedheart added 2 commits August 19, 2025 11:23

Update types.comp

31efd0c

Merge branch 'ggml-org:master' into lovedheart-vulkan-mxfp4-optimization

3f10612

lovedheart added 2 commits August 19, 2025 14:15

Merge branch 'ggml-org:master' into lovedheart-vulkan-mxfp4-optimization

93447d7

Update types.comp

076bd2b

Merge branch 'ggml-org:master' into lovedheart-vulkan-mxfp4-optimization

4243a9f

lovedheart mentioned this pull request Sep 20, 2025

Misc. bug: Vulkan backend shows negative scaling at low batch sizes with MOE models #16134

Open

0cc4m mentioned this pull request Sep 22, 2025

vulkan: use vec dot for matrix matrix multiplications #16056

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

vulkan: optimize mxfp4 #15363

vulkan: optimize mxfp4 #15363

lovedheart commented Aug 16, 2025 •

edited

Loading

Uh oh!

jeffbolznv commented Aug 16, 2025

Uh oh!

lovedheart commented Aug 16, 2025 •

edited

Loading

Uh oh!

TheAIBot commented Aug 17, 2025 •

edited

Loading

Uh oh!

jeffbolznv commented Aug 17, 2025

Uh oh!

0cc4m commented Aug 17, 2025 •

edited

Loading

Uh oh!

0cc4m commented Aug 17, 2025 •

edited

Loading

Uh oh!

jeffbolznv commented Aug 17, 2025

Uh oh!

0cc4m commented Aug 17, 2025

Uh oh!

netrunnereve commented Aug 17, 2025

Uh oh!

characharm commented Aug 17, 2025 •

edited

Loading

Uh oh!

lovedheart commented Aug 18, 2025

Uh oh!

0cc4m commented Aug 19, 2025

Uh oh!

0cc4m commented Aug 19, 2025

Uh oh!

jeffbolznv commented Aug 19, 2025

Uh oh!

netrunnereve commented Aug 19, 2025 •

edited

Loading

Uh oh!

Uh oh!

vulkan: optimize mxfp4 #15363

Are you sure you want to change the base?

vulkan: optimize mxfp4 #15363

Conversation

lovedheart commented Aug 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Running

Benchmark Results

Uh oh!

jeffbolznv commented Aug 16, 2025

Uh oh!

lovedheart commented Aug 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TheAIBot commented Aug 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeffbolznv commented Aug 17, 2025

Uh oh!

0cc4m commented Aug 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

0cc4m commented Aug 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeffbolznv commented Aug 17, 2025

Uh oh!

0cc4m commented Aug 17, 2025

Uh oh!

netrunnereve commented Aug 17, 2025

Uh oh!

characharm commented Aug 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lovedheart commented Aug 18, 2025

Uh oh!

0cc4m commented Aug 19, 2025

Uh oh!

0cc4m commented Aug 19, 2025

Uh oh!

jeffbolznv commented Aug 19, 2025

Uh oh!

netrunnereve commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

lovedheart commented Aug 16, 2025 •

edited

Loading

lovedheart commented Aug 16, 2025 •

edited

Loading

TheAIBot commented Aug 17, 2025 •

edited

Loading

0cc4m commented Aug 17, 2025 •

edited

Loading

0cc4m commented Aug 17, 2025 •

edited

Loading

characharm commented Aug 17, 2025 •

edited

Loading

netrunnereve commented Aug 19, 2025 •

edited

Loading