Skip to content

Conversation

SavicStefan
Copy link

This PR adds the implementation for ACC_TYPE_VEC2. This change, with non-coopmat shaders, using ACC_TYPE_VEC2 improves caching behavior, as accessing 32-bit values is generally more efficient than accessing 16-bit values.

Performance Comparison (Without coopmat and coopmat2) NVIDIA GeForce RTX 4060 Ti
Name Before (us/run) After (us/run) Δ% (Improvement)
MUL_MAT(type_a=f32,type_b=f32,m=4096,n=512,k=14336) 5767.64 5479.83 +4.99%
MUL_MAT(type_a=f16,type_b=f32,m=4096,n=512,k=14336) 5421.40 5047.91 +6.88%
MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=512,k=14336) 5281.02 6002.14 −13.66%
MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336) 2741.43 2748.71 −0.27%
MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336) 2766.60 2764.23 +0.09%
MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336) 2877.49 2875.25 +0.08%
MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336) 2869.17 2867.33 +0.06%
MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336) 2887.17 2890.27 −0.11%
MUL_MAT(type_a=mxfp4,type_b=f32,m=4096,n=512,k=14336) 4976.57 4043.75 +18.75%
MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=512,k=14336) 4938.25 4120.32 +16.56%
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=512,k=14336) 5287.85 4548.30 +13.99%
MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=512,k=14336) 5373.34 4566.63 +15.01%
MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=512,k=14336) 5769.13 4907.47 +14.94%
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=512,k=14336) 5507.98 4524.96 +17.85%
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=512,k=14336) 4877.02 4043.75 +17.07%
MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=512,k=14336) 5010.98 4112.35 +17.94%
MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=512,k=14336) 4863.99 4065.67 +16.41%
MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=512,k=14336) 4957.83 4129.54 +16.70%
MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=512,k=14336) 4583.30 3788.42 +17.34%
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=512,k=14336) 5128.29 4280.64 +16.52%
MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=512,k=14336) 4885.91 3992.67 +18.27%
MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=512,k=14336) 4933.56 4084.30 +17.22%
MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=512,k=14336) 5389.60 4489.23 +16.67%
Performance before(Without coopmat and coopmat2) NVIDIA GeForce RTX 4060 Ti
MUL_MAT(type_a=f32,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                   174 runs -  5767.64 us/run -  60.13 GFLOP/run -  10.43 TFLOPS
MUL_MAT(type_a=f16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                   186 runs -  5421.40 us/run -  60.13 GFLOP/run -  11.09 TFLOPS
MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  190 runs -  5281.02 us/run -  60.13 GFLOP/run -  11.39 TFLOPS
MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  366 runs -  2741.43 us/run -  60.13 GFLOP/run -  21.93 TFLOPS
MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  362 runs -  2766.60 us/run -  60.13 GFLOP/run -  21.73 TFLOPS
MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  348 runs -  2877.49 us/run -  60.13 GFLOP/run -  20.90 TFLOPS
MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  350 runs -  2869.17 us/run -  60.13 GFLOP/run -  20.96 TFLOPS
MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  348 runs -  2887.17 us/run -  60.13 GFLOP/run -  20.83 TFLOPS
MUL_MAT(type_a=mxfp4,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                 202 runs -  4976.57 us/run -  60.13 GFLOP/run -  12.08 TFLOPS
MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  204 runs -  4938.25 us/run -  60.13 GFLOP/run -  12.18 TFLOPS
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  190 runs -  5287.85 us/run -  60.13 GFLOP/run -  11.37 TFLOPS
MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  188 runs -  5373.34 us/run -  60.13 GFLOP/run -  11.19 TFLOPS
MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  174 runs -  5769.13 us/run -  60.13 GFLOP/run -  10.42 TFLOPS
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  182 runs -  5507.98 us/run -  60.13 GFLOP/run -  10.92 TFLOPS
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):               206 runs -  4877.02 us/run -  60.13 GFLOP/run -  12.33 TFLOPS
MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                200 runs -  5010.98 us/run -  60.13 GFLOP/run -  12.00 TFLOPS
MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                 206 runs -  4863.99 us/run -  60.13 GFLOP/run -  12.36 TFLOPS
MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):               202 runs -  4957.83 us/run -  60.13 GFLOP/run -  12.13 TFLOPS
MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                 220 runs -  4583.30 us/run -  60.13 GFLOP/run -  13.12 TFLOPS
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                 196 runs -  5128.29 us/run -  60.13 GFLOP/run -  11.73 TFLOPS
MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                206 runs -  4885.91 us/run -  60.13 GFLOP/run -  12.31 TFLOPS
MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                 204 runs -  4933.56 us/run -  60.13 GFLOP/run -  12.19 TFLOPS
MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                186 runs -  5389.60 us/run -  60.13 GFLOP/run -  11.16 TFLOPS

Performance after(Without coopmat and coopmat2) NVIDIA GeForce RTX 4060 Ti
MUL_MAT(type_a=f32,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                   184 runs -  5479.83 us/run -  60.13 GFLOP/run -  10.97 TFLOPS
MUL_MAT(type_a=f16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                   200 runs -  5047.91 us/run -  60.13 GFLOP/run -  11.91 TFLOPS
MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  168 runs -  6002.14 us/run -  60.13 GFLOP/run -  10.02 TFLOPS
MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  364 runs -  2748.71 us/run -  60.13 GFLOP/run -  21.88 TFLOPS
MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  362 runs -  2764.23 us/run -  60.13 GFLOP/run -  21.75 TFLOPS
MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  348 runs -  2875.25 us/run -  60.13 GFLOP/run -  20.91 TFLOPS
MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  350 runs -  2867.33 us/run -  60.13 GFLOP/run -  20.97 TFLOPS
MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  348 runs -  2890.27 us/run -  60.13 GFLOP/run -  20.80 TFLOPS
MUL_MAT(type_a=mxfp4,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                 248 runs -  4043.75 us/run -  60.13 GFLOP/run -  14.87 TFLOPS
MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  244 runs -  4120.32 us/run -  60.13 GFLOP/run -  14.59 TFLOPS
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  220 runs -  4548.30 us/run -  60.13 GFLOP/run -  13.22 TFLOPS
MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  220 runs -  4566.63 us/run -  60.13 GFLOP/run -  13.17 TFLOPS
MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  204 runs -  4907.47 us/run -  60.13 GFLOP/run -  12.25 TFLOPS
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  222 runs -  4524.96 us/run -  60.13 GFLOP/run -  13.29 TFLOPS
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):               248 runs -  4043.75 us/run -  60.13 GFLOP/run -  14.87 TFLOPS
MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                244 runs -  4112.35 us/run -  60.13 GFLOP/run -  14.62 TFLOPS
MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                 246 runs -  4065.67 us/run -  60.13 GFLOP/run -  14.79 TFLOPS
MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):               244 runs -  4129.54 us/run -  60.13 GFLOP/run -  14.56 TFLOPS
MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                 264 runs -  3788.42 us/run -  60.13 GFLOP/run -  15.87 TFLOPS
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                 234 runs -  4280.64 us/run -  60.13 GFLOP/run -  14.05 TFLOPS
MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                252 runs -  3992.67 us/run -  60.13 GFLOP/run -  15.06 TFLOPS
MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                 246 runs -  4084.30 us/run -  60.13 GFLOP/run -  14.72 TFLOPS
MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                224 runs -  4489.23 us/run -  60.13 GFLOP/run -  13.39 TFLOPS

@SavicStefan SavicStefan requested a review from 0cc4m as a code owner September 23, 2025 15:07
@github-actions github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Sep 23, 2025
@0cc4m
Copy link
Collaborator

0cc4m commented Sep 27, 2025

test-backend-ops -o MUL_MAT_ID is fine on AMD and Intel, but not passing on Nvidia. Something is not fully correct yet. The only difference I can think of is that Nvidia uses the large shader variant. Does it pass for you?

Here are performance results from my devices. It's very good for Nvidia Ampere (which won't be using the code in practice due to coopmat), but neutral or negative on AMD. Not sure why this is.

RTX 3090 without coopmat or integer dot
model size params backend ngl fa test t/s (before) t/s (after) diff
llama 8B IQ1_S - 1.5625 bpw 1.87 GiB 8.03 B Vulkan 99 0 pp512 1246.28 ± 3.00 1489.61 ± 4.57 +19.5%
llama 8B IQ1_S - 1.5625 bpw 1.87 GiB 8.03 B Vulkan 99 1 pp512 1232.94 ± 1.94 1460.53 ± 2.86 +18.5%
llama 8B IQ2_M - 2.7 bpw 2.74 GiB 8.03 B Vulkan 99 0 pp512 1170.17 ± 4.69 1369.08 ± 2.46 +17.0%
llama 8B IQ2_M - 2.7 bpw 2.74 GiB 8.03 B Vulkan 99 1 pp512 1153.60 ± 4.26 1345.56 ± 0.97 +16.6%
llama 8B IQ4_XS - 4.25 bpw 4.13 GiB 8.03 B Vulkan 99 0 pp512 1090.78 ± 3.64 1289.07 ± 2.33 +18.2%
llama 8B IQ4_XS - 4.25 bpw 4.13 GiB 8.03 B Vulkan 99 1 pp512 1078.59 ± 1.14 1266.90 ± 0.73 +17.5%
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 99 0 pp512 1097.40 ± 1.35 1268.20 ± 1.71 +15.6%
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 99 1 pp512 1079.72 ± 4.00 1245.62 ± 4.37 +15.4%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 0 pp512 1223.43 ± 3.87 1471.25 ± 7.55 +20.3%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 1 pp512 1202.96 ± 6.81 1437.72 ± 6.67 +19.5%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 0 pp512 1213.63 ± 4.77 1439.74 ± 4.71 +18.6%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 1 pp512 1187.41 ± 5.71 1411.04 ± 2.48 +18.8%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 0 pp512 1206.44 ± 4.42 1440.83 ± 8.28 +19.4%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 1 pp512 1190.69 ± 3.29 1410.32 ± 7.14 +18.4%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B Vulkan 99 0 pp512 875.14 ± 8.27 1082.85 ± 8.25 +23.7%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B Vulkan 99 1 pp512 857.42 ± 7.47 1077.03 ± 3.25 +25.6%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B Vulkan 99 0 pp512 1008.49 ± 6.09 1453.68 ± 9.48 +44.1%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B Vulkan 99 1 pp512 1013.39 ± 13.40 1443.74 ± 5.54 +42.5%
AMD Radeon Pro VII without integer dot
model size params backend ngl fa test t/s (before) t/s (after) diff
llama 8B IQ1_S - 1.5625 bpw 1.87 GiB 8.03 B Vulkan 99 0 pp512 331.84 ± 1.24 331.86 ± 1.16 +0.0%
llama 8B IQ1_S - 1.5625 bpw 1.87 GiB 8.03 B Vulkan 99 1 pp512 318.19 ± 0.34 316.42 ± 0.62 -0.6%
llama 8B IQ2_M - 2.7 bpw 2.74 GiB 8.03 B Vulkan 99 0 pp512 324.79 ± 0.89 322.82 ± 0.64 -0.6%
llama 8B IQ2_M - 2.7 bpw 2.74 GiB 8.03 B Vulkan 99 1 pp512 311.98 ± 0.42 309.54 ± 0.27 -0.8%
llama 8B IQ4_XS - 4.25 bpw 4.13 GiB 8.03 B Vulkan 99 0 pp512 308.80 ± 0.45 304.46 ± 1.18 -1.4%
llama 8B IQ4_XS - 4.25 bpw 4.13 GiB 8.03 B Vulkan 99 1 pp512 296.04 ± 0.09 291.84 ± 0.58 -1.4%
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 99 0 pp512 295.78 ± 1.20 293.10 ± 1.40 -0.9%
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 99 1 pp512 284.57 ± 0.66 280.86 ± 0.17 -1.3%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 0 pp512 342.79 ± 0.35 336.89 ± 1.07 -1.7%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 1 pp512 327.33 ± 0.26 324.10 ± 0.64 -1.0%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 0 pp512 344.59 ± 0.38 338.37 ± 0.36 -1.8%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 1 pp512 328.27 ± 0.70 324.29 ± 0.14 -1.2%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 0 pp512 335.49 ± 1.22 330.65 ± 0.35 -1.4%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 1 pp512 320.89 ± 0.58 317.12 ± 0.25 -1.2%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B Vulkan 99 0 pp512 390.62 ± 2.24 379.93 ± 3.53 -2.7%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B Vulkan 99 1 pp512 361.05 ± 3.19 353.02 ± 2.66 -2.2%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B Vulkan 99 0 pp512 536.31 ± 4.08 524.24 ± 6.29 -2.3%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B Vulkan 99 1 pp512 528.92 ± 5.79 522.94 ± 5.67 -1.1%
Intel A770 without integer dot
model size params backend ngl fa test t/s (before) t/s (after) diff
llama 8B IQ1_S - 1.5625 bpw 1.87 GiB 8.03 B Vulkan 99 0 pp512 302.05 ± 0.26 281.15 ± 0.84 -6.9%
llama 8B IQ1_S - 1.5625 bpw 1.87 GiB 8.03 B Vulkan 99 1 pp512 103.91 ± 0.07 91.16 ± 0.06 -12.3%
llama 8B IQ2_M - 2.7 bpw 2.74 GiB 8.03 B Vulkan 99 0 pp512 233.62 ± 0.21 229.47 ± 0.29 -1.8%
llama 8B IQ2_M - 2.7 bpw 2.74 GiB 8.03 B Vulkan 99 1 pp512 116.52 ± 0.04 97.96 ± 0.09 -15.9%
llama 8B IQ4_XS - 4.25 bpw 4.13 GiB 8.03 B Vulkan 99 0 pp512 234.15 ± 0.24 232.09 ± 0.31 -0.9%
llama 8B IQ4_XS - 4.25 bpw 4.13 GiB 8.03 B Vulkan 99 1 pp512 107.71 ± 0.10 91.82 ± 0.05 -14.8%
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 99 0 pp512 229.22 ± 0.39 227.45 ± 0.42 -0.8%
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 99 1 pp512 105.69 ± 0.07 90.58 ± 0.06 -14.3%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 0 pp512 292.04 ± 1.09 288.26 ± 0.27 -1.3%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 1 pp512 121.09 ± 0.06 98.07 ± 0.11 -19.0%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 0 pp512 291.05 ± 0.34 282.37 ± 0.33 -3.0%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 1 pp512 121.47 ± 0.10 98.10 ± 0.10 -19.2%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 0 pp512 268.73 ± 0.53 266.03 ± 0.39 -1.0%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 1 pp512 118.45 ± 0.08 100.82 ± 0.08 -14.9%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B Vulkan 99 0 pp512 299.10 ± 1.24 300.25 ± 1.39 +0.4%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B Vulkan 99 1 pp512 123.48 ± 0.32 109.35 ± 0.34 -11.4%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B Vulkan 99 0 pp512 425.12 ± 1.63 426.10 ± 2.06 +0.2%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B Vulkan 99 1 pp512 403.26 ± 4.00 401.98 ± 4.18 -0.3%

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants