Skip to content

Conversation

0cc4m
Copy link
Collaborator

@0cc4m 0cc4m commented Sep 17, 2025

This PR changes the shared memory and register caching to use vec2 instead of scalars. Initially this was to enable vec2 dot instructions for the accumulations, but I think it also helps with caching because accessing 32-bit values is more efficient than accessing 16-bit values.

It needs a few more registers because it loads 2 k-values from shared memory into registers instead of just 1.

…o use vec2 instead of scalars, to enable using dot2 instructions
@0cc4m 0cc4m requested a review from jeffbolznv September 17, 2025 16:34
@github-actions github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Sep 17, 2025
@0cc4m
Copy link
Collaborator Author

0cc4m commented Sep 17, 2025

Nvidia RTX 3090

Coopmat1:

model size params backend ngl fa test t/s (before) t/s (after) diff
llama 8B IQ1_S - 1.5625 bpw 1.87 GiB 8.03 B Vulkan 99 0 pp512 3698.65 ± 20.83 3859.05 ± 22.75 +4.3%
llama 8B IQ1_S - 1.5625 bpw 1.87 GiB 8.03 B Vulkan 99 0 tg128 78.84 ± 0.35 78.35 ± 0.33 -0.6%
llama 8B IQ1_S - 1.5625 bpw 1.87 GiB 8.03 B Vulkan 99 1 pp512 3689.59 ± 19.70 3838.88 ± 6.68 +4.0%
llama 8B IQ1_S - 1.5625 bpw 1.87 GiB 8.03 B Vulkan 99 1 tg128 79.08 ± 0.23 77.80 ± 0.16 -1.6%
llama 8B IQ2_M - 2.7 bpw 2.74 GiB 8.03 B Vulkan 99 0 pp512 2983.33 ± 17.28 3202.46 ± 3.99 +7.3%
llama 8B IQ2_M - 2.7 bpw 2.74 GiB 8.03 B Vulkan 99 0 tg128 98.77 ± 0.38 96.82 ± 0.44 -2.0%
llama 8B IQ2_M - 2.7 bpw 2.74 GiB 8.03 B Vulkan 99 1 pp512 2975.39 ± 8.25 3199.13 ± 5.71 +7.5%
llama 8B IQ2_M - 2.7 bpw 2.74 GiB 8.03 B Vulkan 99 1 tg128 98.06 ± 0.25 95.85 ± 0.26 -2.3%
llama 8B IQ4_XS - 4.25 bpw 4.13 GiB 8.03 B Vulkan 99 0 pp512 2757.56 ± 5.44 2855.59 ± 29.35 +3.6%
llama 8B IQ4_XS - 4.25 bpw 4.13 GiB 8.03 B Vulkan 99 0 tg128 85.73 ± 0.31 84.27 ± 0.17 -1.7%
llama 8B IQ4_XS - 4.25 bpw 4.13 GiB 8.03 B Vulkan 99 1 pp512 2740.31 ± 8.39 2806.00 ± 25.21 +2.4%
llama 8B IQ4_XS - 4.25 bpw 4.13 GiB 8.03 B Vulkan 99 1 tg128 85.17 ± 0.27 83.56 ± 0.33 -1.9%
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 99 0 pp512 2602.50 ± 22.48 2601.89 ± 24.45 -0.0%
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 99 0 tg128 122.45 ± 0.48 122.74 ± 0.58 +0.2%
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 99 1 pp512 2579.82 ± 19.50 2494.23 ± 33.51 -3.3%
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 99 1 tg128 118.14 ± 0.46 117.08 ± 0.18 -0.9%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 0 pp512 3771.03 ± 9.28 3852.97 ± 29.60 +2.2%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 0 tg128 143.73 ± 0.64 145.19 ± 0.67 +1.0%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 1 pp512 3740.55 ± 8.87 3798.49 ± 30.75 +1.5%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 1 tg128 139.63 ± 0.23 142.92 ± 0.47 +2.4%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 0 pp512 3600.01 ± 19.22 3599.48 ± 8.28 -0.0%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 0 tg128 136.24 ± 0.34 137.31 ± 0.29 +0.8%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 1 pp512 3540.87 ± 7.42 3565.16 ± 18.22 +0.7%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 1 tg128 131.89 ± 0.27 133.46 ± 1.74 +1.2%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 0 pp512 3141.03 ± 88.98 3147.74 ± 8.73 +0.2%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 0 tg128 91.56 ± 0.22 91.78 ± 0.24 +0.2%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 1 pp512 3155.82 ± 4.68 3128.18 ± 28.61 -0.9%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 1 tg128 91.04 ± 0.22 91.17 ± 0.11 +0.1%

fp16 only, no coopmat or integer dot:

model size params backend ngl fa test t/s (before) t/s (after) diff
llama 8B IQ1_S - 1.5625 bpw 1.87 GiB 8.03 B Vulkan 99 0 pp512 1119.67 ± 2.49 752.08 ± 1.08 -32.8%
llama 8B IQ1_S - 1.5625 bpw 1.87 GiB 8.03 B Vulkan 99 0 tg128 79.28 ± 0.25 78.06 ± 0.18 -1.5%
llama 8B IQ1_S - 1.5625 bpw 1.87 GiB 8.03 B Vulkan 99 1 pp512 1100.23 ± 1.31 726.70 ± 9.23 -34.0%
llama 8B IQ1_S - 1.5625 bpw 1.87 GiB 8.03 B Vulkan 99 1 tg128 77.58 ± 0.16 76.80 ± 0.10 -1.0%
llama 8B IQ2_M - 2.7 bpw 2.74 GiB 8.03 B Vulkan 99 0 pp512 1044.76 ± 2.95 709.27 ± 9.92 -32.1%
llama 8B IQ2_M - 2.7 bpw 2.74 GiB 8.03 B Vulkan 99 0 tg128 98.11 ± 0.10 97.28 ± 0.15 -0.8%
llama 8B IQ2_M - 2.7 bpw 2.74 GiB 8.03 B Vulkan 99 1 pp512 1018.82 ± 7.00 684.62 ± 1.07 -32.8%
llama 8B IQ2_M - 2.7 bpw 2.74 GiB 8.03 B Vulkan 99 1 tg128 95.57 ± 0.21 95.07 ± 0.21 -0.5%
llama 8B IQ4_XS - 4.25 bpw 4.13 GiB 8.03 B Vulkan 99 0 pp512 959.25 ± 11.63 658.00 ± 10.11 -31.4%
llama 8B IQ4_XS - 4.25 bpw 4.13 GiB 8.03 B Vulkan 99 0 tg128 84.40 ± 0.27 83.88 ± 0.27 -0.6%
llama 8B IQ4_XS - 4.25 bpw 4.13 GiB 8.03 B Vulkan 99 1 pp512 911.18 ± 3.10 638.61 ± 5.44 -29.9%
llama 8B IQ4_XS - 4.25 bpw 4.13 GiB 8.03 B Vulkan 99 1 tg128 83.23 ± 0.05 82.69 ± 0.29 -0.6%
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 99 0 pp512 932.01 ± 16.49 623.06 ± 14.78 -33.1%
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 99 0 tg128 121.33 ± 0.36 120.42 ± 0.64 -0.8%
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 99 1 pp512 900.32 ± 9.58 619.17 ± 4.98 -31.2%
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 99 1 tg128 114.76 ± 0.69 113.73 ± 0.10 -0.9%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 0 pp512 1076.98 ± 31.73 706.16 ± 9.69 -34.4%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 0 tg128 136.72 ± 0.38 134.50 ± 1.09 -1.6%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 1 pp512 1016.89 ± 37.52 689.40 ± 3.62 -32.2%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 1 tg128 130.99 ± 0.25 131.13 ± 0.11 +0.1%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 0 pp512 1052.20 ± 25.71 701.55 ± 10.18 -33.3%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 0 tg128 126.23 ± 0.45 126.20 ± 0.34 -0.0%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 1 pp512 1002.47 ± 6.13 681.43 ± 6.17 -32.0%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 1 tg128 122.55 ± 0.22 122.88 ± 0.27 +0.3%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 0 pp512 1044.00 ± 20.93 705.14 ± 10.96 -32.5%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 0 tg128 91.51 ± 0.16 91.40 ± 0.16 -0.1%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 1 pp512 978.13 ± 10.50 679.43 ± 5.04 -30.5%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 1 tg128 89.77 ± 0.10 90.40 ± 0.61 +0.7%
AMD Radeon Pro VII
model size params backend ngl fa test t/s (before) t/s (after) diff
llama 8B IQ1_S - 1.5625 bpw 1.87 GiB 8.03 B Vulkan 99 0 pp512 329.39 ± 0.45 332.38 ± 0.22 +0.9%
llama 8B IQ1_S - 1.5625 bpw 1.87 GiB 8.03 B Vulkan 99 0 tg128 83.10 ± 0.39 82.44 ± 0.16 -0.8%
llama 8B IQ1_S - 1.5625 bpw 1.87 GiB 8.03 B Vulkan 99 1 pp512 313.21 ± 0.25 316.30 ± 0.33 +1.0%
llama 8B IQ1_S - 1.5625 bpw 1.87 GiB 8.03 B Vulkan 99 1 tg128 78.20 ± 0.50 77.65 ± 0.17 -0.7%
llama 8B IQ2_M - 2.7 bpw 2.74 GiB 8.03 B Vulkan 99 0 pp512 313.72 ± 0.26 324.10 ± 0.32 +3.3%
llama 8B IQ2_M - 2.7 bpw 2.74 GiB 8.03 B Vulkan 99 0 tg128 70.35 ± 0.08 70.40 ± 0.12 +0.1%
llama 8B IQ2_M - 2.7 bpw 2.74 GiB 8.03 B Vulkan 99 1 pp512 301.52 ± 0.17 310.86 ± 0.18 +3.1%
llama 8B IQ2_M - 2.7 bpw 2.74 GiB 8.03 B Vulkan 99 1 tg128 67.19 ± 0.08 66.92 ± 0.24 -0.4%
llama 8B IQ4_XS - 4.25 bpw 4.13 GiB 8.03 B Vulkan 99 0 pp512 301.87 ± 0.17 305.47 ± 0.15 +1.2%
llama 8B IQ4_XS - 4.25 bpw 4.13 GiB 8.03 B Vulkan 99 0 tg128 52.41 ± 0.19 51.68 ± 0.02 -1.4%
llama 8B IQ4_XS - 4.25 bpw 4.13 GiB 8.03 B Vulkan 99 1 pp512 289.62 ± 0.25 291.86 ± 0.77 +0.8%
llama 8B IQ4_XS - 4.25 bpw 4.13 GiB 8.03 B Vulkan 99 1 tg128 50.57 ± 0.10 49.91 ± 0.16 -1.3%
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 99 0 pp512 291.44 ± 0.30 293.55 ± 0.40 +0.7%
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 99 0 tg128 71.63 ± 1.49 72.07 ± 0.21 +0.6%
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 99 1 pp512 279.90 ± 0.09 281.62 ± 0.36 +0.6%
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 99 1 tg128 69.85 ± 0.16 68.78 ± 0.44 -1.5%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 0 pp512 738.77 ± 0.35 723.68 ± 5.41 -2.0%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 0 tg128 98.54 ± 0.46 93.41 ± 1.09 -5.2%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 1 pp512 667.78 ± 0.57 657.61 ± 2.00 -1.5%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 1 tg128 91.73 ± 0.63 86.98 ± 0.33 -5.2%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 0 pp512 644.67 ± 0.27 641.24 ± 0.95 -0.5%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 0 tg128 102.06 ± 0.77 96.76 ± 0.26 -5.2%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 1 pp512 590.17 ± 0.46 581.79 ± 0.55 -1.4%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 1 tg128 93.57 ± 0.49 89.83 ± 0.36 -4.0%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 0 pp512 647.84 ± 0.44 639.25 ± 4.52 -1.3%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 0 tg128 73.12 ± 0.14 68.81 ± 0.07 -5.9%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 1 pp512 591.01 ± 0.49 584.45 ± 0.11 -1.1%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 1 tg128 66.84 ± 1.17 65.50 ± 0.03 -2.0%

without integer dot:

model size params backend ngl fa test t/s (before) t/s (after) diff
llama 8B IQ1_S - 1.5625 bpw 1.87 GiB 8.03 B Vulkan 99 0 pp512 326.17 ± 0.69 331.83 ± 0.47 +1.7%
llama 8B IQ1_S - 1.5625 bpw 1.87 GiB 8.03 B Vulkan 99 0 tg128 82.59 ± 0.19 82.84 ± 0.17 +0.3%
llama 8B IQ1_S - 1.5625 bpw 1.87 GiB 8.03 B Vulkan 99 1 pp512 309.99 ± 0.15 314.68 ± 0.53 +1.5%
llama 8B IQ1_S - 1.5625 bpw 1.87 GiB 8.03 B Vulkan 99 1 tg128 77.47 ± 0.12 77.83 ± 0.29 +0.5%
llama 8B IQ2_M - 2.7 bpw 2.74 GiB 8.03 B Vulkan 99 0 pp512 310.69 ± 0.17 322.13 ± 0.23 +3.7%
llama 8B IQ2_M - 2.7 bpw 2.74 GiB 8.03 B Vulkan 99 0 tg128 70.41 ± 0.12 70.65 ± 0.06 +0.3%
llama 8B IQ2_M - 2.7 bpw 2.74 GiB 8.03 B Vulkan 99 1 pp512 300.37 ± 0.17 311.01 ± 0.20 +3.5%
llama 8B IQ2_M - 2.7 bpw 2.74 GiB 8.03 B Vulkan 99 1 tg128 67.12 ± 0.08 67.38 ± 0.04 +0.4%
llama 8B IQ4_XS - 4.25 bpw 4.13 GiB 8.03 B Vulkan 99 0 pp512 299.74 ± 1.32 305.85 ± 0.31 +2.0%
llama 8B IQ4_XS - 4.25 bpw 4.13 GiB 8.03 B Vulkan 99 0 tg128 52.70 ± 0.29 52.20 ± 0.04 -0.9%
llama 8B IQ4_XS - 4.25 bpw 4.13 GiB 8.03 B Vulkan 99 1 pp512 289.29 ± 0.30 294.30 ± 0.19 +1.7%
llama 8B IQ4_XS - 4.25 bpw 4.13 GiB 8.03 B Vulkan 99 1 tg128 50.85 ± 0.09 50.50 ± 0.03 -0.7%
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 99 0 pp512 290.47 ± 0.22 295.79 ± 1.78 +1.8%
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 99 0 tg128 73.39 ± 0.16 73.35 ± 0.10 -0.1%
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 99 1 pp512 279.92 ± 0.21 283.10 ± 0.12 +1.1%
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 99 1 tg128 70.09 ± 0.07 70.27 ± 0.07 +0.3%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 0 pp512 336.16 ± 0.33 339.18 ± 0.71 +0.9%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 0 tg128 75.25 ± 0.29 75.26 ± 0.08 +0.0%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 1 pp512 321.76 ± 0.26 326.49 ± 0.83 +1.5%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 1 tg128 71.64 ± 0.09 71.64 ± 0.07 +0.0%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 0 pp512 338.00 ± 0.24 341.55 ± 0.77 +1.1%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 0 tg128 67.96 ± 0.07 67.98 ± 0.08 +0.0%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 1 pp512 324.30 ± 0.15 326.77 ± 0.21 +0.8%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 1 tg128 65.09 ± 0.06 65.12 ± 0.14 +0.0%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 0 pp512 330.27 ± 0.30 334.57 ± 0.41 +1.3%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 0 tg128 59.72 ± 0.07 59.70 ± 0.07 -0.0%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 1 pp512 316.34 ± 0.18 320.76 ± 0.20 +1.4%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 1 tg128 57.59 ± 0.13 57.51 ± 0.07 -0.1%

fp32 only:

model size params backend ngl fa test t/s (before) t/s (after) diff
llama 8B IQ1_S - 1.5625 bpw 1.87 GiB 8.03 B Vulkan 99 0 pp512 453.65 ± 2.42 465.19 ± 0.65 +2.5%
llama 8B IQ1_S - 1.5625 bpw 1.87 GiB 8.03 B Vulkan 99 0 tg128 82.31 ± 0.29 82.27 ± 0.17 -0.0%
llama 8B IQ1_S - 1.5625 bpw 1.87 GiB 8.03 B Vulkan 99 1 pp512 426.52 ± 0.54 437.85 ± 0.70 +2.7%
llama 8B IQ1_S - 1.5625 bpw 1.87 GiB 8.03 B Vulkan 99 1 tg128 76.91 ± 0.05 77.63 ± 0.15 +0.9%
llama 8B IQ2_M - 2.7 bpw 2.74 GiB 8.03 B Vulkan 99 0 pp512 407.66 ± 0.92 422.86 ± 0.83 +3.7%
llama 8B IQ2_M - 2.7 bpw 2.74 GiB 8.03 B Vulkan 99 0 tg128 69.88 ± 0.06 70.44 ± 0.13 +0.8%
llama 8B IQ2_M - 2.7 bpw 2.74 GiB 8.03 B Vulkan 99 1 pp512 386.83 ± 1.02 401.07 ± 0.23 +3.7%
llama 8B IQ2_M - 2.7 bpw 2.74 GiB 8.03 B Vulkan 99 1 tg128 66.58 ± 0.14 66.93 ± 0.05 +0.5%
llama 8B IQ4_XS - 4.25 bpw 4.13 GiB 8.03 B Vulkan 99 0 pp512 417.37 ± 0.34 428.36 ± 2.82 +2.6%
llama 8B IQ4_XS - 4.25 bpw 4.13 GiB 8.03 B Vulkan 99 0 tg128 51.35 ± 0.05 51.74 ± 0.05 +0.8%
llama 8B IQ4_XS - 4.25 bpw 4.13 GiB 8.03 B Vulkan 99 1 pp512 392.04 ± 0.78 401.79 ± 1.51 +2.5%
llama 8B IQ4_XS - 4.25 bpw 4.13 GiB 8.03 B Vulkan 99 1 tg128 49.60 ± 0.08 50.01 ± 0.07 +0.8%
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 99 0 pp512 391.88 ± 0.41 404.59 ± 0.73 +3.2%
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 99 0 tg128 71.91 ± 0.20 72.84 ± 0.12 +1.3%
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 99 1 pp512 374.42 ± 0.40 382.74 ± 0.23 +2.2%
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 99 1 tg128 68.60 ± 0.21 69.55 ± 0.08 +1.4%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 0 pp512 467.25 ± 0.18 480.87 ± 0.44 +2.9%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 0 tg128 74.20 ± 0.17 74.63 ± 0.12 +0.6%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 1 pp512 437.94 ± 0.16 447.99 ± 0.51 +2.3%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 1 tg128 70.34 ± 0.22 70.55 ± 0.19 +0.3%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 0 pp512 467.31 ± 0.54 481.17 ± 0.46 +3.0%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 0 tg128 67.30 ± 0.08 67.23 ± 0.13 -0.1%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 1 pp512 439.78 ± 0.34 449.46 ± 0.25 +2.2%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 1 tg128 64.36 ± 0.10 63.88 ± 0.19 -0.7%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 0 pp512 454.17 ± 0.24 469.37 ± 0.32 +3.3%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 0 tg128 58.73 ± 0.21 58.24 ± 0.11 -0.8%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 1 pp512 426.97 ± 0.49 437.85 ± 0.43 +2.5%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 1 tg128 56.23 ± 0.12 55.94 ± 0.07 -0.5%
AMD Radeon RX 6800 XT
model size params backend ngl fa test t/s (before) t/s (after) diff
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 99 0 pp512 767.64 ± 0.58 747.48 ± 0.32 -2.6%
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 99 0 tg128 93.12 ± 0.83 93.41 ± 0.01 +0.3%
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 99 1 pp512 748.10 ± 0.34 731.19 ± 0.34 -2.3%
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 99 1 tg128 88.36 ± 0.02 88.23 ± 0.01 -0.1%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 0 pp512 1666.38 ± 1.89 1655.20 ± 1.68 -0.7%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 0 tg128 92.68 ± 0.03 92.67 ± 0.01 -0.0%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 1 pp512 1581.70 ± 0.81 1579.72 ± 0.73 -0.1%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 1 tg128 87.69 ± 0.02 87.67 ± 0.01 -0.0%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 0 pp512 1622.83 ± 1.29 1613.93 ± 2.53 -0.5%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 0 tg128 85.22 ± 0.01 85.23 ± 0.01 +0.0%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 1 pp512 1543.40 ± 0.72 1541.32 ± 1.24 -0.1%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 1 tg128 80.58 ± 0.02 80.59 ± 0.01 +0.0%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 0 pp512 1399.66 ± 1.58 1392.71 ± 1.31 -0.5%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 0 tg128 56.55 ± 0.00 56.55 ± 0.00 +0.0%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 1 pp512 1339.90 ± 0.28 1338.06 ± 0.67 -0.1%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 1 tg128 53.81 ± 0.01 53.78 ± 0.01 -0.1%

without integer dot:

model size params backend ngl fa test t/s (before) t/s (after) diff
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 99 0 pp512 764.87 ± 0.84 750.44 ± 0.67 -1.9%
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 99 0 tg128 93.43 ± 0.01 93.45 ± 0.01 +0.0%
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 99 1 pp512 745.07 ± 0.31 732.19 ± 0.32 -1.7%
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 99 1 tg128 88.14 ± 0.01 88.16 ± 0.01 +0.0%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 0 pp512 951.68 ± 0.65 950.55 ± 0.71 -0.1%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 0 tg128 94.56 ± 0.04 94.59 ± 0.01 +0.0%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 1 pp512 924.07 ± 0.24 922.76 ± 1.03 -0.1%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 1 tg128 89.46 ± 0.00 89.44 ± 0.02 -0.0%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 0 pp512 965.97 ± 0.70 959.26 ± 1.08 -0.7%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 0 tg128 79.54 ± 0.01 79.56 ± 0.01 +0.0%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 1 pp512 937.70 ± 0.78 931.57 ± 0.44 -0.7%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 1 tg128 75.18 ± 0.02 75.28 ± 0.01 +0.1%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 0 pp512 930.01 ± 0.92 916.18 ± 0.79 -1.5%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 0 tg128 56.55 ± 0.01 56.54 ± 0.01 -0.0%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 1 pp512 902.92 ± 0.48 890.01 ± 0.36 -1.4%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 1 tg128 53.81 ± 0.01 53.79 ± 0.01 -0.0%
Intel A770
model size params backend ngl fa test t/s (before) t/s (after) diff
llama 8B IQ1_S - 1.5625 bpw 1.87 GiB 8.03 B Vulkan 99 0 pp512 102.71 ± 0.07 302.83 ± 0.49 +194.8%
llama 8B IQ1_S - 1.5625 bpw 1.87 GiB 8.03 B Vulkan 99 0 tg128 32.75 ± 0.02 32.79 ± 0.02 +0.1%
llama 8B IQ1_S - 1.5625 bpw 1.87 GiB 8.03 B Vulkan 99 1 pp512 78.95 ± 0.10 104.01 ± 0.05 +31.7%
llama 8B IQ1_S - 1.5625 bpw 1.87 GiB 8.03 B Vulkan 99 1 tg128 25.36 ± 0.05 25.38 ± 0.00 +0.1%
llama 8B IQ2_M - 2.7 bpw 2.74 GiB 8.03 B Vulkan 99 0 pp512 99.17 ± 0.07 234.19 ± 0.41 +136.2%
llama 8B IQ2_M - 2.7 bpw 2.74 GiB 8.03 B Vulkan 99 0 tg128 18.40 ± 0.01 18.45 ± 0.01 +0.3%
llama 8B IQ2_M - 2.7 bpw 2.74 GiB 8.03 B Vulkan 99 1 pp512 76.45 ± 0.07 116.61 ± 0.06 +52.5%
llama 8B IQ2_M - 2.7 bpw 2.74 GiB 8.03 B Vulkan 99 1 tg128 15.89 ± 0.02 15.92 ± 0.03 +0.2%
llama 8B IQ4_XS - 4.25 bpw 4.13 GiB 8.03 B Vulkan 99 0 pp512 99.72 ± 0.10 233.54 ± 1.02 +134.2%
llama 8B IQ4_XS - 4.25 bpw 4.13 GiB 8.03 B Vulkan 99 0 tg128 16.17 ± 0.01 16.19 ± 0.03 +0.1%
llama 8B IQ4_XS - 4.25 bpw 4.13 GiB 8.03 B Vulkan 99 1 pp512 76.37 ± 0.12 107.72 ± 0.07 +41.1%
llama 8B IQ4_XS - 4.25 bpw 4.13 GiB 8.03 B Vulkan 99 1 tg128 14.36 ± 0.00 14.38 ± 0.00 +0.1%
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 99 0 pp512 98.99 ± 0.17 229.17 ± 0.46 +131.5%
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 99 0 tg128 37.99 ± 0.03 38.06 ± 0.02 +0.2%
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 99 1 pp512 77.42 ± 0.07 105.78 ± 0.09 +36.6%
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 99 1 tg128 28.11 ± 0.01 28.08 ± 0.04 -0.1%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 0 pp512 830.11 ± 1.23 904.34 ± 1.77 +8.9%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 0 tg128 38.91 ± 0.04 38.98 ± 0.01 +0.2%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 1 pp512 252.63 ± 0.11 252.76 ± 0.08 +0.1%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 1 tg128 28.87 ± 0.00 28.88 ± 0.02 +0.0%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 0 pp512 828.89 ± 1.68 900.07 ± 1.16 +8.6%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 0 tg128 44.91 ± 0.19 44.95 ± 0.05 +0.1%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 1 pp512 252.75 ± 0.09 252.59 ± 0.22 -0.1%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 1 tg128 31.70 ± 0.02 31.74 ± 0.01 +0.1%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 0 pp512 742.66 ± 1.56 797.29 ± 1.28 +7.4%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 0 tg128 35.37 ± 0.02 35.42 ± 0.01 +0.1%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 1 pp512 243.30 ± 0.12 243.36 ± 0.11 +0.0%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 1 tg128 26.82 ± 0.03 26.87 ± 0.01 +0.2%

In summary, positive for Nvidia coopmat1, slightly positive for AMD Vega20, slightly negative for AMD RDNA2, and very positive for Intel Alchemist.

@netrunnereve
Copy link
Collaborator

Here's a quick test on my 470, it runs slightly faster or slower depending on the quant used. From looking at the changes it shouldn't make a difference but I guess the driver's compiling things a bit differently.

As for your results the only thing that really matters are the non dp4a p512 numbers. Something is really really wrong with those Nvidia runs and I have a feeling that this might have terrible results on a Maxwell or Pascal chip. Meanwhile for AMD did you check if you're using the V_DOT2_F32_F16 instruction? At least with RGP I've never been able to get it to generate that even if I use dot() on two fp16 vectors.

Intel's good so that's good, and it looks like that it's now using the dot product instructions.

PR:

model size params backend ngl test t/s
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 pp512 185.77 ± 1.52
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 tg128 38.76 ± 0.08
  MUL_MAT(type_a=f32,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                    26 runs - 41540.58 us/run -  60.13 GFLOP/run -   1.45 TFLOPS
  MUL_MAT(type_a=f16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                    40 runs - 25755.83 us/run -  60.13 GFLOP/run -   2.33 TFLOPS
  MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                   38 runs - 26883.37 us/run -  60.13 GFLOP/run -   2.24 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                   44 runs - 22990.64 us/run -  60.13 GFLOP/run -   2.62 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                   44 runs - 22889.14 us/run -  60.13 GFLOP/run -   2.63 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                   42 runs - 24746.45 us/run -  60.13 GFLOP/run -   2.43 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                   42 runs - 24276.55 us/run -  60.13 GFLOP/run -   2.48 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                   42 runs - 24019.43 us/run -  60.13 GFLOP/run -   2.50 TFLOPS
  MUL_MAT(type_a=mxfp4,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  46 runs - 22668.83 us/run -  60.13 GFLOP/run -   2.65 TFLOPS
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                   42 runs - 24327.69 us/run -  60.13 GFLOP/run -   2.47 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                   38 runs - 26900.53 us/run -  60.13 GFLOP/run -   2.24 TFLOPS
  MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                   36 runs - 27928.81 us/run -  60.13 GFLOP/run -   2.15 TFLOPS
  MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                   34 runs - 30028.00 us/run -  60.13 GFLOP/run -   2.00 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                   36 runs - 28748.83 us/run -  60.13 GFLOP/run -   2.09 TFLOPS
  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                44 runs - 23652.18 us/run -  60.13 GFLOP/run -   2.54 TFLOPS
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                 44 runs - 23300.91 us/run -  60.13 GFLOP/run -   2.58 TFLOPS
  MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  38 runs - 27562.39 us/run -  60.13 GFLOP/run -   2.18 TFLOPS
  MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                42 runs - 24565.81 us/run -  60.13 GFLOP/run -   2.45 TFLOPS
  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  44 runs - 23326.89 us/run -  60.13 GFLOP/run -   2.58 TFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  42 runs - 23996.62 us/run -  60.13 GFLOP/run -   2.51 TFLOPS
  MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                 44 runs - 23616.02 us/run -  60.13 GFLOP/run -   2.55 TFLOPS
  MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  42 runs - 24867.29 us/run -  60.13 GFLOP/run -   2.42 TFLOPS
  MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                 38 runs - 26893.34 us/run -  60.13 GFLOP/run -   2.24 TFLOPS

Master:

model size params backend ngl test t/s
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 pp512 191.66 ± 1.13
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 100 tg128 38.23 ± 0.15
  MUL_MAT(type_a=f32,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                    24 runs - 42105.12 us/run -  60.13 GFLOP/run -   1.43 TFLOPS
  MUL_MAT(type_a=f16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                    38 runs - 26393.45 us/run -  60.13 GFLOP/run -   2.28 TFLOPS
  MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                   36 runs - 28066.75 us/run -  60.13 GFLOP/run -   2.14 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                   44 runs - 23684.27 us/run -  60.13 GFLOP/run -   2.54 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                   44 runs - 23501.84 us/run -  60.13 GFLOP/run -   2.56 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                   40 runs - 25421.35 us/run -  60.13 GFLOP/run -   2.37 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                   42 runs - 24880.21 us/run -  60.13 GFLOP/run -   2.42 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                   42 runs - 24610.71 us/run -  60.13 GFLOP/run -   2.44 TFLOPS
  MUL_MAT(type_a=mxfp4,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  44 runs - 23247.95 us/run -  60.13 GFLOP/run -   2.59 TFLOPS
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                   40 runs - 25229.40 us/run -  60.13 GFLOP/run -   2.38 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                   38 runs - 27730.39 us/run -  60.13 GFLOP/run -   2.17 TFLOPS
  MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                   36 runs - 28658.36 us/run -  60.13 GFLOP/run -   2.10 TFLOPS
  MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                   34 runs - 30667.38 us/run -  60.13 GFLOP/run -   1.96 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                   36 runs - 29375.75 us/run -  60.13 GFLOP/run -   2.05 TFLOPS
  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                42 runs - 24167.88 us/run -  60.13 GFLOP/run -   2.49 TFLOPS
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                 42 runs - 24162.33 us/run -  60.13 GFLOP/run -   2.49 TFLOPS
  MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  36 runs - 28005.72 us/run -  60.13 GFLOP/run -   2.15 TFLOPS
  MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                42 runs - 24828.86 us/run -  60.13 GFLOP/run -   2.42 TFLOPS
  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  44 runs - 23753.23 us/run -  60.13 GFLOP/run -   2.53 TFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  42 runs - 24208.48 us/run -  60.13 GFLOP/run -   2.48 TFLOPS
  MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                 42 runs - 24274.55 us/run -  60.13 GFLOP/run -   2.48 TFLOPS
  MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  40 runs - 25486.95 us/run -  60.13 GFLOP/run -   2.36 TFLOPS
  MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                 38 runs - 27545.03 us/run -  60.13 GFLOP/run -   2.18 TFLOPS

@0cc4m
Copy link
Collaborator Author

0cc4m commented Sep 19, 2025

@netrunnereve Thank you for the test. Yeah, I'm still torn on this. It's good for Intel and coopmat1, but slightly negative for other cases, very negative on Nvidia and on Apple. I'll put it back on draft for now and try to figure out whether there's a way to improve this.

I did not test it with RGP, so maybe it's not actually hitting the instructions. But I'd have to check with RADV, not sure how that works.

@jeffbolznv Do you know why this is very bad on Ampere (non-coopmat) and whether this likely affects pre-Turing the same way?

@0cc4m 0cc4m marked this pull request as draft September 19, 2025 06:24
@jeffbolznv
Copy link
Collaborator

@jeffbolznv Do you know why this is very bad on Ampere (non-coopmat) and whether this likely affects pre-Turing the same way?

I didn't see anything obvious in the change so I looked at the sass we generate. I think the change is due to the use of the dot intrinsic, our compiler will expand it into componentwise mul+add and has trouble re-vectorizing that into paired fma instructions.

I think shaderFloat16 is not enabled on most pre-Turing devices (for performance reasons) so they would probably not be affected. But a few, like Tesla P100(?) presumably would.

@0cc4m
Copy link
Collaborator Author

0cc4m commented Sep 19, 2025

@jeffbolznv Do you know why this is very bad on Ampere (non-coopmat) and whether this likely affects pre-Turing the same way?

I didn't see anything obvious in the change so I looked at the sass we generate. I think the change is due to the use of the dot intrinsic, our compiler will expand it into componentwise mul+add and has trouble re-vectorizing that into paired fma instructions.

That's odd, I used dot specifically to create a situation where there are two independent multiplies/adds that should be easy to fuse into one of the dot or fma instructions that many GPU architectures have. I have to look at AMD assembly to see if it worked there.

@jeffbolznv
Copy link
Collaborator

Yeah, this is our compiler not doing a great job. It might work better to do fmas on the vectors and then combine the two components at the end.

@0cc4m
Copy link
Collaborator Author

0cc4m commented Sep 19, 2025

@jeffbolznv You are correct, thank you. This diff fixes the performance issue on Nvidia and Apple:

diff --git a/ggml/src/ggml-vulkan/vulkan-shaders/mul_mm.comp b/ggml/src/ggml-vulkan/vulkan-shaders/mul_mm.comp
index d22230fad..38a4d07d0 100644
--- a/ggml/src/ggml-vulkan/vulkan-shaders/mul_mm.comp
+++ b/ggml/src/ggml-vulkan/vulkan-shaders/mul_mm.comp
@@ -357,7 +357,7 @@ void main() {
                     [[unroll]] for (uint cc = 0; cc < TN; cc++) {
                         [[unroll]] for (uint cr = 0; cr < TM; cr++) {
                             const uint sums_idx = (wsic * TN + cc) * (WMITER * TM) + wsir * TM + cr;
-                            sums[sums_idx] += dot(ACC_TYPE_VEC2(cache_a[wsir * TM + cr]), ACC_TYPE_VEC2(cache_b[cc]));
+                            sums[sums_idx] = fma(ACC_TYPE(cache_a[wsir * TM + cr].x), ACC_TYPE(cache_b[cc].x), fma(ACC_TYPE(cache_a[wsir * TM + cr].y), ACC_TYPE(cache_b[cc].y), sums[sums_idx]));
                         }
                     }
                 }

@0cc4m
Copy link
Collaborator Author

0cc4m commented Sep 19, 2025

Here are updated results:

Nvidia RTX 3090 without coopmat or integer dot
model size params backend ngl fa test t/s (before) t/s (after) diff
llama 8B IQ1_S - 1.5625 bpw 1.87 GiB 8.03 B Vulkan 99 0 pp512 1127.73 ± 2.10 1244.38 ± 4.58 +10.3%
llama 8B IQ1_S - 1.5625 bpw 1.87 GiB 8.03 B Vulkan 99 1 pp512 1113.12 ± 2.21 1228.22 ± 3.10 +10.3%
llama 8B IQ2_M - 2.7 bpw 2.74 GiB 8.03 B Vulkan 99 0 pp512 1059.88 ± 5.29 1183.09 ± 2.50 +11.6%
llama 8B IQ2_M - 2.7 bpw 2.74 GiB 8.03 B Vulkan 99 1 pp512 1042.59 ± 3.66 1168.04 ± 2.60 +12.0%
llama 8B IQ4_XS - 4.25 bpw 4.13 GiB 8.03 B Vulkan 99 0 pp512 985.79 ± 3.83 1089.85 ± 4.18 +10.6%
llama 8B IQ4_XS - 4.25 bpw 4.13 GiB 8.03 B Vulkan 99 1 pp512 974.48 ± 1.13 1073.09 ± 7.68 +10.1%
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 99 0 pp512 978.75 ± 3.61 1091.84 ± 5.65 +11.6%
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 99 1 pp512 970.51 ± 0.57 1066.46 ± 5.65 +9.9%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 0 pp512 1131.58 ± 5.90 1215.66 ± 15.93 +7.4%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 1 pp512 1113.63 ± 1.68 1188.07 ± 10.87 +6.7%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 0 pp512 1100.51 ± 3.50 1209.61 ± 8.40 +9.9%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 1 pp512 1085.11 ± 2.51 1179.40 ± 5.16 +8.7%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 0 pp512 1092.64 ± 4.17 1205.93 ± 3.23 +10.4%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 1 pp512 1079.56 ± 2.01 1180.21 ± 11.35 +9.3%
AMD Radeon Pro VII without integer dot
model size params backend ngl fa test t/s (before) t/s (after) diff
llama 8B IQ1_S - 1.5625 bpw 1.87 GiB 8.03 B Vulkan 99 0 pp512 327.01 ± 0.32 333.32 ± 1.92 +1.9%
llama 8B IQ1_S - 1.5625 bpw 1.87 GiB 8.03 B Vulkan 99 1 pp512 312.49 ± 0.24 317.96 ± 0.36 +1.8%
llama 8B IQ2_M - 2.7 bpw 2.74 GiB 8.03 B Vulkan 99 0 pp512 314.93 ± 0.19 323.17 ± 0.87 +2.6%
llama 8B IQ2_M - 2.7 bpw 2.74 GiB 8.03 B Vulkan 99 1 pp512 302.94 ± 0.09 310.66 ± 0.48 +2.5%
llama 8B IQ4_XS - 4.25 bpw 4.13 GiB 8.03 B Vulkan 99 0 pp512 303.29 ± 0.71 305.22 ± 0.72 +0.6%
llama 8B IQ4_XS - 4.25 bpw 4.13 GiB 8.03 B Vulkan 99 1 pp512 291.26 ± 0.11 293.29 ± 0.43 +0.7%
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 99 0 pp512 293.15 ± 0.32 293.67 ± 1.06 +0.2%
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 99 1 pp512 281.64 ± 0.20 283.24 ± 0.56 +0.6%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 0 pp512 338.81 ± 1.06 339.25 ± 1.85 +0.1%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 1 pp512 323.43 ± 0.18 325.40 ± 0.65 +0.6%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 0 pp512 341.30 ± 0.24 340.93 ± 1.42 -0.1%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 1 pp512 326.30 ± 0.29 326.85 ± 0.86 +0.2%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 0 pp512 332.56 ± 0.26 335.34 ± 1.45 +0.8%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 1 pp512 317.07 ± 0.30 320.66 ± 0.77 +1.1%
AMD Radeon Pro VII without integer dot or fp16
model size params backend ngl fa test t/s (before) t/s (after) diff
llama 8B IQ1_S - 1.5625 bpw 1.87 GiB 8.03 B Vulkan 99 0 pp512 454.79 ± 2.11 468.57 ± 1.22 +3.0%
llama 8B IQ1_S - 1.5625 bpw 1.87 GiB 8.03 B Vulkan 99 1 pp512 427.14 ± 1.50 439.25 ± 0.59 +2.8%
llama 8B IQ2_M - 2.7 bpw 2.74 GiB 8.03 B Vulkan 99 0 pp512 410.62 ± 2.13 423.55 ± 1.34 +3.1%
llama 8B IQ2_M - 2.7 bpw 2.74 GiB 8.03 B Vulkan 99 1 pp512 388.88 ± 0.76 401.95 ± 0.55 +3.4%
llama 8B IQ4_XS - 4.25 bpw 4.13 GiB 8.03 B Vulkan 99 0 pp512 417.58 ± 2.28 423.92 ± 0.52 +1.5%
llama 8B IQ4_XS - 4.25 bpw 4.13 GiB 8.03 B Vulkan 99 1 pp512 395.26 ± 1.12 401.45 ± 0.44 +1.6%
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 99 0 pp512 397.12 ± 2.83 400.94 ± 1.53 +1.0%
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 99 1 pp512 376.35 ± 0.71 381.93 ± 0.30 +1.5%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 0 pp512 471.54 ± 2.17 477.35 ± 0.45 +1.2%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 1 pp512 440.95 ± 1.48 448.43 ± 0.63 +1.7%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 0 pp512 471.33 ± 1.64 478.87 ± 0.47 +1.6%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 1 pp512 442.25 ± 1.09 450.15 ± 0.23 +1.8%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 0 pp512 455.63 ± 0.36 467.57 ± 0.85 +2.6%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 1 pp512 428.45 ± 0.27 439.68 ± 0.32 +2.6%
AMD Radeon RX 6800 XT without integer dot
model size params backend ngl fa test t/s (before) t/s (after) diff
llama 8B IQ1_S - 1.5625 bpw 1.87 GiB 8.03 B Vulkan 99 0 pp512 935.47 ± 0.58 933.18 ± 0.23 -0.2%
llama 8B IQ1_S - 1.5625 bpw 1.87 GiB 8.03 B Vulkan 99 1 pp512 908.75 ± 0.53 906.94 ± 0.57 -0.2%
llama 8B IQ2_M - 2.7 bpw 2.74 GiB 8.03 B Vulkan 99 0 pp512 915.66 ± 0.70 907.38 ± 0.93 -0.9%
llama 8B IQ2_M - 2.7 bpw 2.74 GiB 8.03 B Vulkan 99 1 pp512 889.68 ± 0.22 883.55 ± 0.13 -0.7%
llama 8B IQ4_XS - 4.25 bpw 4.13 GiB 8.03 B Vulkan 99 0 pp512 799.45 ± 0.50 792.41 ± 0.35 -0.9%
llama 8B IQ4_XS - 4.25 bpw 4.13 GiB 8.03 B Vulkan 99 1 pp512 780.09 ± 0.14 774.30 ± 0.22 -0.7%
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 99 0 pp512 761.47 ± 0.44 747.38 ± 0.28 -1.9%
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 99 1 pp512 744.30 ± 0.25 731.15 ± 0.25 -1.8%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 0 pp512 951.67 ± 0.59 950.34 ± 0.55 -0.1%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 1 pp512 924.76 ± 0.40 924.53 ± 0.28 -0.0%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 0 pp512 966.97 ± 0.59 961.14 ± 0.37 -0.6%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 1 pp512 939.43 ± 0.68 934.73 ± 0.27 -0.5%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 0 pp512 930.58 ± 0.57 917.98 ± 0.62 -1.4%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 1 pp512 905.20 ± 0.37 893.14 ± 0.15 -1.3%
Intel A770 without integer dot
model size params backend ngl fa test t/s (before) t/s (after) diff
llama 8B IQ1_S - 1.5625 bpw 1.87 GiB 8.03 B Vulkan 99 0 pp512 102.81 ± 0.07 301.44 ± 0.85 +193.2%
llama 8B IQ1_S - 1.5625 bpw 1.87 GiB 8.03 B Vulkan 99 1 pp512 78.92 ± 0.11 103.90 ± 0.04 +31.7%
llama 8B IQ2_M - 2.7 bpw 2.74 GiB 8.03 B Vulkan 99 0 pp512 99.07 ± 0.08 234.28 ± 0.17 +136.5%
llama 8B IQ2_M - 2.7 bpw 2.74 GiB 8.03 B Vulkan 99 1 pp512 76.43 ± 0.05 116.72 ± 0.04 +52.7%
llama 8B IQ4_XS - 4.25 bpw 4.13 GiB 8.03 B Vulkan 99 0 pp512 99.56 ± 0.12 234.16 ± 0.24 +135.2%
llama 8B IQ4_XS - 4.25 bpw 4.13 GiB 8.03 B Vulkan 99 1 pp512 76.43 ± 0.05 107.81 ± 0.13 +41.1%
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 99 0 pp512 98.93 ± 0.09 229.02 ± 0.51 +131.5%
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 99 1 pp512 77.38 ± 0.04 105.77 ± 0.06 +36.7%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 0 pp512 104.65 ± 0.02 291.24 ± 0.37 +178.3%
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 1 pp512 79.63 ± 0.04 121.13 ± 0.07 +52.1%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 0 pp512 105.08 ± 0.10 290.53 ± 0.29 +176.5%
llama 8B Q4_1 4.77 GiB 8.03 B Vulkan 99 1 pp512 80.50 ± 0.11 121.54 ± 0.04 +51.0%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 0 pp512 99.65 ± 0.13 268.29 ± 0.56 +169.2%
llama 8B Q8_0 7.95 GiB 8.03 B Vulkan 99 1 pp512 76.89 ± 0.13 118.52 ± 0.02 +54.1%

@netrunnereve I think you are right, I see lots of v_fma_f16 in the Vega20 disassembly, but no v_pk_fma_f16, only a few cases of v_pk_mul_f16. Same with RDNA2. Not sure how I would trigger that or why it isn't triggering.

@netrunnereve
Copy link
Collaborator

netrunnereve commented Sep 19, 2025

I did not test it with RGP, so maybe it's not actually hitting the instructions. But I'd have to check with RADV, not sure how that works.

I think you figured this out already but for radv you can use RADV_DEBUG=shaders to get the assembly. To ensure that you're getting the correct shader run it along with test backend ops on a single test case only.

@netrunnereve I think you are right, I see lots of v_fma_f16 in the Vega20 disassembly, but no v_pk_fma_f16, only a few cases of v_pk_mul_f16. Same with RDNA2. Not sure how I would trigger that or why it isn't triggering.

Even a packed multiply or fma isn't good as that only does the calculation in parallel for the upper and lower 16 bits, and you'll still need additional instructions to add them together and convert it to fp32. With V_DOT2_F32_F16 this does the entire dot product, fp32 conversion, and a free addition with the previous sum all in one cycle.

Honestly this is something worthy of a Mesa ticket (idc about amdvlk anymore as amd got rid of it), and I doubt they've put a lot of effort into this as graphics apps mostly use fp32. Since RGP uses amdvlk in the background it looks like AMD hasn't looked into this use case either. Maybe there should be a shaderfloatingpointdotproduct vulkan extension for this.

@0cc4m
Copy link
Collaborator Author

0cc4m commented Sep 19, 2025

Yeah, I looked for dot first, but I couldn't trigger any dot function on either GPU. At least a few of the packed ones triggered, but fma would have been more interesting than a few quant-specific packed muls.

I can trigger dot2 in RDNA4 in amdvlk RGP with this source code:

#version 450

#extension GL_EXT_shader_explicit_arithmetic_types_float16 : require

layout (binding = 0) readonly buffer A {f16vec2 data_a[];};
layout (binding = 1) writeonly buffer B {float data_b[];};

void main()
{
    uint index = gl_GlobalInvocationID.x;

	const f16vec2 val0 = data_a[index * 2];
	const f16vec2 val1 = data_a[index * 2 + 1];

	data_b[index] = dot(val0, val1);
}

For RDNA 4 this gives me:

22	 0x00008C	     v_dot2_f16_f16   	 v1.l,  v2,  v1,  0	 3,12	 
23	 0x000094	     v_cvt_f32_f16_e32	 v1,  v1.l         	 2,12	 

but for older generations at most a packed mul:

18	 0x000068	     v_pk_mul_f16     	 v1,  v1,  v2                                                                   	 3,8	 
19	 0x000070	     v_add_f16_sdwa   	 v1,  v1,  v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_1	 2,8	 
20	 0x000078	     v_cvt_f32_f16_e32	 v1,  v1                                                                        	 2,8	 

With RADV, for Vega20 I just get this:

        v_mul_f16_sdwa v1, s2, v0 dst_sel:WORD_0 dst_unused:UNUSED_PRESERVE src0_sel:WORD_1 src1_sel:DWORD ; 440200f9 06851402
        v_fma_f16 v1, s2, v2, v1                                    ; d2060001 04060402
        v_cvt_f32_f16_e32 v3, v1                                    ; 7e061701

Yeah, I think we should open an issue with Mesa. I don't think an extension would make it through, since it's basically just up to the compiler to spot the opportunity to use it.

@0cc4m 0cc4m marked this pull request as ready for review September 19, 2025 18:44
@0cc4m 0cc4m requested a review from jeffbolznv September 19, 2025 18:45
Copy link
Collaborator

@jeffbolznv jeffbolznv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I verified the latest commit is generating good code.

Copy link
Collaborator

@netrunnereve netrunnereve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For RDNA 4 this gives me:

And here they waste a clock cycle for the "conversion" when there's an instruction that already does that. Also v_dot2_f16_f16 is not a RDNA4 exclusive so it looks like there's some different compiler logic for that chip.

Yeah, I think we should open an issue with Mesa. I don't think an extension would make it through, since it's basically just up to the compiler to spot the opportunity to use it.

I mean the spirv literally contains a dot instruction and the compiler should recognise that, hey, this is a fp16 dot product and we have an instruction for it! The only thing an extension would do is force the use of the instruction and also indicate that the GPU has special hardware support for dot products.

I've been thinking of getting a MI25 or MI50 once I figure out if it's possible to cut a hole in the cover and place a fan inside. Adding those mini fans to the front is going to be loud and my computer doesn't have the space for it. If I manage to get that going I'll probably start looking into how Mesa handles the dot products internally.

Anyways I think this is good to merge now.

@0cc4m 0cc4m merged commit 803dac2 into master Sep 20, 2025
92 of 94 checks passed
@0cc4m 0cc4m deleted the 0cc4m/vulkan-mm-use-vec-dot branch September 20, 2025 08:43
@wbruna
Copy link
Contributor

wbruna commented Sep 21, 2025

This change seems to have broken prompt adherence on stable-diffusion.cpp: leejet/stable-diffusion.cpp#847 .

@etasnadi
Copy link
Contributor

sass

@jeffbolznv Do you know why this is very bad on Ampere (non-coopmat) and whether this likely affects pre-Turing the same way?

I didn't see anything obvious in the change so I looked at the sass we generate. I think the change is due to the use of the dot intrinsic, our compiler will expand it into componentwise mul+add and has trouble re-vectorizing that into paired fma instructions.

I think shaderFloat16 is not enabled on most pre-Turing devices (for performance reasons) so they would probably not be affected. But a few, like Tesla P100(?) presumably would.

How do you obtain the sass code? Is there any documented way of doing that? I am only aware of processing the files in ~/.nv/GLCache using 3rd party tools like https://github.com/therontarigo/nvcachetools.

@0cc4m
Copy link
Collaborator Author

0cc4m commented Sep 22, 2025

How do you obtain the sass code? Is there any documented way of doing that? I am only aware of processing the files in ~/.nv/GLCache using 3rd party tools like https://github.com/therontarigo/nvcachetools.

The trick is working for Nvidia. 😄

@etasnadi
Copy link
Contributor

How do you obtain the sass code? Is there any documented way of doing that? I am only aware of processing the files in ~/.nv/GLCache using 3rd party tools like https://github.com/therontarigo/nvcachetools.

The trick is working for Nvidia. 😄

Sure, I am interested not only because it would make profiling easier but it would be also nice if we could inject any sass code into the shader compiled with nvcc.

@0cc4m
Copy link
Collaborator Author

0cc4m commented Sep 22, 2025

How do you obtain the sass code? Is there any documented way of doing that? I am only aware of processing the files in ~/.nv/GLCache using 3rd party tools like https://github.com/therontarigo/nvcachetools.

The trick is working for Nvidia. 😄

Sure, I am interested not only because it would make profiling easier but it would be also nice if we could inject any sass code into the shader compiled with nvcc.

I've asked this as well: #15363 (comment)

@etasnadi
Copy link
Contributor

How do you obtain the sass code? Is there any documented way of doing that? I am only aware of processing the files in ~/.nv/GLCache using 3rd party tools like https://github.com/therontarigo/nvcachetools.

The trick is working for Nvidia. 😄

Sure, I am interested not only because it would make profiling easier but it would be also nice if we could inject any sass code into the shader compiled with nvcc.

I've asked this as well: #15363 (comment)

Thanks, I didn't know that you can profile compute shaders with Nsight on instruction level (e.g bank conflicts, register usage, etc.).

struct pushed a commit to struct/llama.cpp that referenced this pull request Sep 26, 2025
* vulkan: Change the mul_mm shared memory and register caching system to use vec2 instead of scalars, to enable using dot2 instructions

* use fma instead of dot to fix Nvidia and Apple performance issues
@netrunnereve
Copy link
Collaborator

netrunnereve commented Sep 28, 2025

FYI I took a quick look on the ACO side and turns out they don't have special support for FP16 dot products. They do have packed fmas and multiplies though, and surprisingly they have support for BF16 dot products for RDNA 3 and 4. Of course there's no point in doing that in our case as those chips have coopmat.

There's actually a Mesa PR for this but it got closed.

@0cc4m
Copy link
Collaborator Author

0cc4m commented Sep 29, 2025

Thank you for checking. Can you open a RADV issue for it? They closed that PR because it didn't bring benefits for games, but I think for us it would be helpful.

@netrunnereve
Copy link
Collaborator

Sure if you provide the spirv and Mesa assembly dump for your card. Please generate that with the dot() function as spirv has an instruction for that and that's what the compiler will be looking for, I don't think it can recognise the fmas you're using now.

Otherwise I'll deal with this once I get my MI card 😁.

@0cc4m
Copy link
Collaborator Author

0cc4m commented Oct 2, 2025

I talked briefly with the author of that Mesa PR, and basically there are precision issues with the dot2 instructions and different unexpected behaviour across the AMD generations implementing it, that's why they don't want to use them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants