vulkan: Add ACC_TYPE_VEC2 implementation #16203

SavicStefan · 2025-09-23T15:07:25Z

This PR adds the implementation for ACC_TYPE_VEC2. This change, with non-coopmat shaders, using ACC_TYPE_VEC2 improves caching behavior, as accessing 32-bit values is generally more efficient than accessing 16-bit values.

Performance Comparison (Without coopmat and coopmat2) NVIDIA GeForce RTX 4060 Ti

Name	Before (us/run)	After (us/run)	Δ% (Improvement)
MUL_MAT(type_a=f32,type_b=f32,m=4096,n=512,k=14336)	5767.64	5479.83	+4.99%
MUL_MAT(type_a=f16,type_b=f32,m=4096,n=512,k=14336)	5421.40	5047.91	+6.88%
MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=512,k=14336)	5281.02	6002.14	−13.66%
MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336)	2741.43	2748.71	−0.27%
MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336)	2766.60	2764.23	+0.09%
MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336)	2877.49	2875.25	+0.08%
MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336)	2869.17	2867.33	+0.06%
MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336)	2887.17	2890.27	−0.11%
MUL_MAT(type_a=mxfp4,type_b=f32,m=4096,n=512,k=14336)	4976.57	4043.75	+18.75%
MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=512,k=14336)	4938.25	4120.32	+16.56%
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=512,k=14336)	5287.85	4548.30	+13.99%
MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=512,k=14336)	5373.34	4566.63	+15.01%
MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=512,k=14336)	5769.13	4907.47	+14.94%
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=512,k=14336)	5507.98	4524.96	+17.85%
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=512,k=14336)	4877.02	4043.75	+17.07%
MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=512,k=14336)	5010.98	4112.35	+17.94%
MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=512,k=14336)	4863.99	4065.67	+16.41%
MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=512,k=14336)	4957.83	4129.54	+16.70%
MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=512,k=14336)	4583.30	3788.42	+17.34%
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=512,k=14336)	5128.29	4280.64	+16.52%
MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=512,k=14336)	4885.91	3992.67	+18.27%
MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=512,k=14336)	4933.56	4084.30	+17.22%
MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=512,k=14336)	5389.60	4489.23	+16.67%

Performance before(Without coopmat and coopmat2) NVIDIA GeForce RTX 4060 Ti

MUL_MAT(type_a=f32,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                   174 runs -  5767.64 us/run -  60.13 GFLOP/run -  10.43 TFLOPS
MUL_MAT(type_a=f16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                   186 runs -  5421.40 us/run -  60.13 GFLOP/run -  11.09 TFLOPS
MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  190 runs -  5281.02 us/run -  60.13 GFLOP/run -  11.39 TFLOPS
MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  366 runs -  2741.43 us/run -  60.13 GFLOP/run -  21.93 TFLOPS
MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  362 runs -  2766.60 us/run -  60.13 GFLOP/run -  21.73 TFLOPS
MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  348 runs -  2877.49 us/run -  60.13 GFLOP/run -  20.90 TFLOPS
MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  350 runs -  2869.17 us/run -  60.13 GFLOP/run -  20.96 TFLOPS
MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  348 runs -  2887.17 us/run -  60.13 GFLOP/run -  20.83 TFLOPS
MUL_MAT(type_a=mxfp4,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                 202 runs -  4976.57 us/run -  60.13 GFLOP/run -  12.08 TFLOPS
MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  204 runs -  4938.25 us/run -  60.13 GFLOP/run -  12.18 TFLOPS
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  190 runs -  5287.85 us/run -  60.13 GFLOP/run -  11.37 TFLOPS
MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  188 runs -  5373.34 us/run -  60.13 GFLOP/run -  11.19 TFLOPS
MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  174 runs -  5769.13 us/run -  60.13 GFLOP/run -  10.42 TFLOPS
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  182 runs -  5507.98 us/run -  60.13 GFLOP/run -  10.92 TFLOPS
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):               206 runs -  4877.02 us/run -  60.13 GFLOP/run -  12.33 TFLOPS
MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                200 runs -  5010.98 us/run -  60.13 GFLOP/run -  12.00 TFLOPS
MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                 206 runs -  4863.99 us/run -  60.13 GFLOP/run -  12.36 TFLOPS
MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):               202 runs -  4957.83 us/run -  60.13 GFLOP/run -  12.13 TFLOPS
MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                 220 runs -  4583.30 us/run -  60.13 GFLOP/run -  13.12 TFLOPS
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                 196 runs -  5128.29 us/run -  60.13 GFLOP/run -  11.73 TFLOPS
MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                206 runs -  4885.91 us/run -  60.13 GFLOP/run -  12.31 TFLOPS
MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                 204 runs -  4933.56 us/run -  60.13 GFLOP/run -  12.19 TFLOPS
MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                186 runs -  5389.60 us/run -  60.13 GFLOP/run -  11.16 TFLOPS

Performance after(Without coopmat and coopmat2) NVIDIA GeForce RTX 4060 Ti

MUL_MAT(type_a=f32,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                   184 runs -  5479.83 us/run -  60.13 GFLOP/run -  10.97 TFLOPS
MUL_MAT(type_a=f16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                   200 runs -  5047.91 us/run -  60.13 GFLOP/run -  11.91 TFLOPS
MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  168 runs -  6002.14 us/run -  60.13 GFLOP/run -  10.02 TFLOPS
MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  364 runs -  2748.71 us/run -  60.13 GFLOP/run -  21.88 TFLOPS
MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  362 runs -  2764.23 us/run -  60.13 GFLOP/run -  21.75 TFLOPS
MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  348 runs -  2875.25 us/run -  60.13 GFLOP/run -  20.91 TFLOPS
MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  350 runs -  2867.33 us/run -  60.13 GFLOP/run -  20.97 TFLOPS
MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  348 runs -  2890.27 us/run -  60.13 GFLOP/run -  20.80 TFLOPS
MUL_MAT(type_a=mxfp4,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                 248 runs -  4043.75 us/run -  60.13 GFLOP/run -  14.87 TFLOPS
MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  244 runs -  4120.32 us/run -  60.13 GFLOP/run -  14.59 TFLOPS
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  220 runs -  4548.30 us/run -  60.13 GFLOP/run -  13.22 TFLOPS
MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  220 runs -  4566.63 us/run -  60.13 GFLOP/run -  13.17 TFLOPS
MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  204 runs -  4907.47 us/run -  60.13 GFLOP/run -  12.25 TFLOPS
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  222 runs -  4524.96 us/run -  60.13 GFLOP/run -  13.29 TFLOPS
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):               248 runs -  4043.75 us/run -  60.13 GFLOP/run -  14.87 TFLOPS
MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                244 runs -  4112.35 us/run -  60.13 GFLOP/run -  14.62 TFLOPS
MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                 246 runs -  4065.67 us/run -  60.13 GFLOP/run -  14.79 TFLOPS
MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):               244 runs -  4129.54 us/run -  60.13 GFLOP/run -  14.56 TFLOPS
MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                 264 runs -  3788.42 us/run -  60.13 GFLOP/run -  15.87 TFLOPS
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                 234 runs -  4280.64 us/run -  60.13 GFLOP/run -  14.05 TFLOPS
MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                252 runs -  3992.67 us/run -  60.13 GFLOP/run -  15.06 TFLOPS
MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                 246 runs -  4084.30 us/run -  60.13 GFLOP/run -  14.72 TFLOPS
MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                224 runs -  4489.23 us/run -  60.13 GFLOP/run -  13.39 TFLOPS

Signed-off-by: Stefan Savic <[email protected]>

0cc4m · 2025-09-27T10:22:06Z

test-backend-ops -o MUL_MAT_ID is fine on AMD and Intel, but not passing on Nvidia. Something is not fully correct yet. The only difference I can think of is that Nvidia uses the large shader variant. Does it pass for you?

Here are performance results from my devices. It's very good for Nvidia Ampere (which won't be using the code in practice due to coopmat), but neutral or negative on AMD. Not sure why this is.

RTX 3090 without coopmat or integer dot

model	size	params	backend	ngl	fa	test	t/s (before)	t/s (after)	diff
llama 8B IQ1_S - 1.5625 bpw	1.87 GiB	8.03 B	Vulkan	99	0	pp512	1246.28 ± 3.00	1489.61 ± 4.57	+19.5%
llama 8B IQ1_S - 1.5625 bpw	1.87 GiB	8.03 B	Vulkan	99	1	pp512	1232.94 ± 1.94	1460.53 ± 2.86	+18.5%
llama 8B IQ2_M - 2.7 bpw	2.74 GiB	8.03 B	Vulkan	99	0	pp512	1170.17 ± 4.69	1369.08 ± 2.46	+17.0%
llama 8B IQ2_M - 2.7 bpw	2.74 GiB	8.03 B	Vulkan	99	1	pp512	1153.60 ± 4.26	1345.56 ± 0.97	+16.6%
llama 8B IQ4_XS - 4.25 bpw	4.13 GiB	8.03 B	Vulkan	99	0	pp512	1090.78 ± 3.64	1289.07 ± 2.33	+18.2%
llama 8B IQ4_XS - 4.25 bpw	4.13 GiB	8.03 B	Vulkan	99	1	pp512	1078.59 ± 1.14	1266.90 ± 0.73	+17.5%
llama 8B Q4_K - Small	4.36 GiB	8.03 B	Vulkan	99	0	pp512	1097.40 ± 1.35	1268.20 ± 1.71	+15.6%
llama 8B Q4_K - Small	4.36 GiB	8.03 B	Vulkan	99	1	pp512	1079.72 ± 4.00	1245.62 ± 4.37	+15.4%
llama 8B Q4_0	4.33 GiB	8.03 B	Vulkan	99	0	pp512	1223.43 ± 3.87	1471.25 ± 7.55	+20.3%
llama 8B Q4_0	4.33 GiB	8.03 B	Vulkan	99	1	pp512	1202.96 ± 6.81	1437.72 ± 6.67	+19.5%
llama 8B Q4_1	4.77 GiB	8.03 B	Vulkan	99	0	pp512	1213.63 ± 4.77	1439.74 ± 4.71	+18.6%
llama 8B Q4_1	4.77 GiB	8.03 B	Vulkan	99	1	pp512	1187.41 ± 5.71	1411.04 ± 2.48	+18.8%
llama 8B Q8_0	7.95 GiB	8.03 B	Vulkan	99	0	pp512	1206.44 ± 4.42	1440.83 ± 8.28	+19.4%
llama 8B Q8_0	7.95 GiB	8.03 B	Vulkan	99	1	pp512	1190.69 ± 3.29	1410.32 ± 7.14	+18.4%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	Vulkan	99	0	pp512	875.14 ± 8.27	1082.85 ± 8.25	+23.7%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	Vulkan	99	1	pp512	857.42 ± 7.47	1077.03 ± 3.25	+25.6%
gpt-oss 20B Q8_0	11.27 GiB	20.91 B	Vulkan	99	0	pp512	1008.49 ± 6.09	1453.68 ± 9.48	+44.1%
gpt-oss 20B Q8_0	11.27 GiB	20.91 B	Vulkan	99	1	pp512	1013.39 ± 13.40	1443.74 ± 5.54	+42.5%

AMD Radeon Pro VII without integer dot

model	size	params	backend	ngl	fa	test	t/s (before)	t/s (after)	diff
llama 8B IQ1_S - 1.5625 bpw	1.87 GiB	8.03 B	Vulkan	99	0	pp512	331.84 ± 1.24	331.86 ± 1.16	+0.0%
llama 8B IQ1_S - 1.5625 bpw	1.87 GiB	8.03 B	Vulkan	99	1	pp512	318.19 ± 0.34	316.42 ± 0.62	-0.6%
llama 8B IQ2_M - 2.7 bpw	2.74 GiB	8.03 B	Vulkan	99	0	pp512	324.79 ± 0.89	322.82 ± 0.64	-0.6%
llama 8B IQ2_M - 2.7 bpw	2.74 GiB	8.03 B	Vulkan	99	1	pp512	311.98 ± 0.42	309.54 ± 0.27	-0.8%
llama 8B IQ4_XS - 4.25 bpw	4.13 GiB	8.03 B	Vulkan	99	0	pp512	308.80 ± 0.45	304.46 ± 1.18	-1.4%
llama 8B IQ4_XS - 4.25 bpw	4.13 GiB	8.03 B	Vulkan	99	1	pp512	296.04 ± 0.09	291.84 ± 0.58	-1.4%
llama 8B Q4_K - Small	4.36 GiB	8.03 B	Vulkan	99	0	pp512	295.78 ± 1.20	293.10 ± 1.40	-0.9%
llama 8B Q4_K - Small	4.36 GiB	8.03 B	Vulkan	99	1	pp512	284.57 ± 0.66	280.86 ± 0.17	-1.3%
llama 8B Q4_0	4.33 GiB	8.03 B	Vulkan	99	0	pp512	342.79 ± 0.35	336.89 ± 1.07	-1.7%
llama 8B Q4_0	4.33 GiB	8.03 B	Vulkan	99	1	pp512	327.33 ± 0.26	324.10 ± 0.64	-1.0%
llama 8B Q4_1	4.77 GiB	8.03 B	Vulkan	99	0	pp512	344.59 ± 0.38	338.37 ± 0.36	-1.8%
llama 8B Q4_1	4.77 GiB	8.03 B	Vulkan	99	1	pp512	328.27 ± 0.70	324.29 ± 0.14	-1.2%
llama 8B Q8_0	7.95 GiB	8.03 B	Vulkan	99	0	pp512	335.49 ± 1.22	330.65 ± 0.35	-1.4%
llama 8B Q8_0	7.95 GiB	8.03 B	Vulkan	99	1	pp512	320.89 ± 0.58	317.12 ± 0.25	-1.2%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	Vulkan	99	0	pp512	390.62 ± 2.24	379.93 ± 3.53	-2.7%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	Vulkan	99	1	pp512	361.05 ± 3.19	353.02 ± 2.66	-2.2%
gpt-oss 20B Q8_0	11.27 GiB	20.91 B	Vulkan	99	0	pp512	536.31 ± 4.08	524.24 ± 6.29	-2.3%
gpt-oss 20B Q8_0	11.27 GiB	20.91 B	Vulkan	99	1	pp512	528.92 ± 5.79	522.94 ± 5.67	-1.1%

Intel A770 without integer dot

model	size	params	backend	ngl	fa	test	t/s (before)	t/s (after)	diff
llama 8B IQ1_S - 1.5625 bpw	1.87 GiB	8.03 B	Vulkan	99	0	pp512	302.05 ± 0.26	281.15 ± 0.84	-6.9%
llama 8B IQ1_S - 1.5625 bpw	1.87 GiB	8.03 B	Vulkan	99	1	pp512	103.91 ± 0.07	91.16 ± 0.06	-12.3%
llama 8B IQ2_M - 2.7 bpw	2.74 GiB	8.03 B	Vulkan	99	0	pp512	233.62 ± 0.21	229.47 ± 0.29	-1.8%
llama 8B IQ2_M - 2.7 bpw	2.74 GiB	8.03 B	Vulkan	99	1	pp512	116.52 ± 0.04	97.96 ± 0.09	-15.9%
llama 8B IQ4_XS - 4.25 bpw	4.13 GiB	8.03 B	Vulkan	99	0	pp512	234.15 ± 0.24	232.09 ± 0.31	-0.9%
llama 8B IQ4_XS - 4.25 bpw	4.13 GiB	8.03 B	Vulkan	99	1	pp512	107.71 ± 0.10	91.82 ± 0.05	-14.8%
llama 8B Q4_K - Small	4.36 GiB	8.03 B	Vulkan	99	0	pp512	229.22 ± 0.39	227.45 ± 0.42	-0.8%
llama 8B Q4_K - Small	4.36 GiB	8.03 B	Vulkan	99	1	pp512	105.69 ± 0.07	90.58 ± 0.06	-14.3%
llama 8B Q4_0	4.33 GiB	8.03 B	Vulkan	99	0	pp512	292.04 ± 1.09	288.26 ± 0.27	-1.3%
llama 8B Q4_0	4.33 GiB	8.03 B	Vulkan	99	1	pp512	121.09 ± 0.06	98.07 ± 0.11	-19.0%
llama 8B Q4_1	4.77 GiB	8.03 B	Vulkan	99	0	pp512	291.05 ± 0.34	282.37 ± 0.33	-3.0%
llama 8B Q4_1	4.77 GiB	8.03 B	Vulkan	99	1	pp512	121.47 ± 0.10	98.10 ± 0.10	-19.2%
llama 8B Q8_0	7.95 GiB	8.03 B	Vulkan	99	0	pp512	268.73 ± 0.53	266.03 ± 0.39	-1.0%
llama 8B Q8_0	7.95 GiB	8.03 B	Vulkan	99	1	pp512	118.45 ± 0.08	100.82 ± 0.08	-14.9%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	Vulkan	99	0	pp512	299.10 ± 1.24	300.25 ± 1.39	+0.4%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	Vulkan	99	1	pp512	123.48 ± 0.32	109.35 ± 0.34	-11.4%
gpt-oss 20B Q8_0	11.27 GiB	20.91 B	Vulkan	99	0	pp512	425.12 ± 1.63	426.10 ± 2.06	+0.2%
gpt-oss 20B Q8_0	11.27 GiB	20.91 B	Vulkan	99	1	pp512	403.26 ± 4.00	401.98 ± 4.18	-0.3%

vulkan: Add ACC_TYPE_VEC2 implementation

105e1e1

Signed-off-by: Stefan Savic <[email protected]>

SavicStefan requested a review from 0cc4m as a code owner September 23, 2025 15:07

github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Sep 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

vulkan: Add ACC_TYPE_VEC2 implementation #16203

vulkan: Add ACC_TYPE_VEC2 implementation #16203

Uh oh!

SavicStefan commented Sep 23, 2025

Uh oh!

0cc4m commented Sep 27, 2025

Uh oh!

Uh oh!

vulkan: Add ACC_TYPE_VEC2 implementation #16203

Are you sure you want to change the base?

vulkan: Add ACC_TYPE_VEC2 implementation #16203

Uh oh!

Conversation

SavicStefan commented Sep 23, 2025

Uh oh!

0cc4m commented Sep 27, 2025

Uh oh!

Uh oh!