vulkan: use vec dot for matrix matrix multiplications #16056

0cc4m · 2025-09-17T16:34:17Z

This PR changes the shared memory and register caching to use vec2 instead of scalars. Initially this was to enable vec2 dot instructions for the accumulations, but I think it also helps with caching because accessing 32-bit values is more efficient than accessing 16-bit values.

It needs a few more registers because it loads 2 k-values from shared memory into registers instead of just 1.

…o use vec2 instead of scalars, to enable using dot2 instructions

0cc4m · 2025-09-17T17:05:19Z

Nvidia RTX 3090

Coopmat1:

model	size	params	backend	ngl	fa	test	t/s (before)	t/s (after)	diff
llama 8B IQ1_S - 1.5625 bpw	1.87 GiB	8.03 B	Vulkan	99	0	pp512	3698.65 ± 20.83	3859.05 ± 22.75	+4.3%
llama 8B IQ1_S - 1.5625 bpw	1.87 GiB	8.03 B	Vulkan	99	0	tg128	78.84 ± 0.35	78.35 ± 0.33	-0.6%
llama 8B IQ1_S - 1.5625 bpw	1.87 GiB	8.03 B	Vulkan	99	1	pp512	3689.59 ± 19.70	3838.88 ± 6.68	+4.0%
llama 8B IQ1_S - 1.5625 bpw	1.87 GiB	8.03 B	Vulkan	99	1	tg128	79.08 ± 0.23	77.80 ± 0.16	-1.6%
llama 8B IQ2_M - 2.7 bpw	2.74 GiB	8.03 B	Vulkan	99	0	pp512	2983.33 ± 17.28	3202.46 ± 3.99	+7.3%
llama 8B IQ2_M - 2.7 bpw	2.74 GiB	8.03 B	Vulkan	99	0	tg128	98.77 ± 0.38	96.82 ± 0.44	-2.0%
llama 8B IQ2_M - 2.7 bpw	2.74 GiB	8.03 B	Vulkan	99	1	pp512	2975.39 ± 8.25	3199.13 ± 5.71	+7.5%
llama 8B IQ2_M - 2.7 bpw	2.74 GiB	8.03 B	Vulkan	99	1	tg128	98.06 ± 0.25	95.85 ± 0.26	-2.3%
llama 8B IQ4_XS - 4.25 bpw	4.13 GiB	8.03 B	Vulkan	99	0	pp512	2757.56 ± 5.44	2855.59 ± 29.35	+3.6%
llama 8B IQ4_XS - 4.25 bpw	4.13 GiB	8.03 B	Vulkan	99	0	tg128	85.73 ± 0.31	84.27 ± 0.17	-1.7%
llama 8B IQ4_XS - 4.25 bpw	4.13 GiB	8.03 B	Vulkan	99	1	pp512	2740.31 ± 8.39	2806.00 ± 25.21	+2.4%
llama 8B IQ4_XS - 4.25 bpw	4.13 GiB	8.03 B	Vulkan	99	1	tg128	85.17 ± 0.27	83.56 ± 0.33	-1.9%
llama 8B Q4_K - Small	4.36 GiB	8.03 B	Vulkan	99	0	pp512	2602.50 ± 22.48	2601.89 ± 24.45	-0.0%
llama 8B Q4_K - Small	4.36 GiB	8.03 B	Vulkan	99	0	tg128	122.45 ± 0.48	122.74 ± 0.58	+0.2%
llama 8B Q4_K - Small	4.36 GiB	8.03 B	Vulkan	99	1	pp512	2579.82 ± 19.50	2494.23 ± 33.51	-3.3%
llama 8B Q4_K - Small	4.36 GiB	8.03 B	Vulkan	99	1	tg128	118.14 ± 0.46	117.08 ± 0.18	-0.9%
llama 8B Q4_0	4.33 GiB	8.03 B	Vulkan	99	0	pp512	3771.03 ± 9.28	3852.97 ± 29.60	+2.2%
llama 8B Q4_0	4.33 GiB	8.03 B	Vulkan	99	0	tg128	143.73 ± 0.64	145.19 ± 0.67	+1.0%
llama 8B Q4_0	4.33 GiB	8.03 B	Vulkan	99	1	pp512	3740.55 ± 8.87	3798.49 ± 30.75	+1.5%
llama 8B Q4_0	4.33 GiB	8.03 B	Vulkan	99	1	tg128	139.63 ± 0.23	142.92 ± 0.47	+2.4%
llama 8B Q4_1	4.77 GiB	8.03 B	Vulkan	99	0	pp512	3600.01 ± 19.22	3599.48 ± 8.28	-0.0%
llama 8B Q4_1	4.77 GiB	8.03 B	Vulkan	99	0	tg128	136.24 ± 0.34	137.31 ± 0.29	+0.8%
llama 8B Q4_1	4.77 GiB	8.03 B	Vulkan	99	1	pp512	3540.87 ± 7.42	3565.16 ± 18.22	+0.7%
llama 8B Q4_1	4.77 GiB	8.03 B	Vulkan	99	1	tg128	131.89 ± 0.27	133.46 ± 1.74	+1.2%
llama 8B Q8_0	7.95 GiB	8.03 B	Vulkan	99	0	pp512	3141.03 ± 88.98	3147.74 ± 8.73	+0.2%
llama 8B Q8_0	7.95 GiB	8.03 B	Vulkan	99	0	tg128	91.56 ± 0.22	91.78 ± 0.24	+0.2%
llama 8B Q8_0	7.95 GiB	8.03 B	Vulkan	99	1	pp512	3155.82 ± 4.68	3128.18 ± 28.61	-0.9%
llama 8B Q8_0	7.95 GiB	8.03 B	Vulkan	99	1	tg128	91.04 ± 0.22	91.17 ± 0.11	+0.1%

fp16 only, no coopmat or integer dot:

model	size	params	backend	ngl	fa	test	t/s (before)	t/s (after)	diff
llama 8B IQ1_S - 1.5625 bpw	1.87 GiB	8.03 B	Vulkan	99	0	pp512	1119.67 ± 2.49	752.08 ± 1.08	-32.8%
llama 8B IQ1_S - 1.5625 bpw	1.87 GiB	8.03 B	Vulkan	99	0	tg128	79.28 ± 0.25	78.06 ± 0.18	-1.5%
llama 8B IQ1_S - 1.5625 bpw	1.87 GiB	8.03 B	Vulkan	99	1	pp512	1100.23 ± 1.31	726.70 ± 9.23	-34.0%
llama 8B IQ1_S - 1.5625 bpw	1.87 GiB	8.03 B	Vulkan	99	1	tg128	77.58 ± 0.16	76.80 ± 0.10	-1.0%
llama 8B IQ2_M - 2.7 bpw	2.74 GiB	8.03 B	Vulkan	99	0	pp512	1044.76 ± 2.95	709.27 ± 9.92	-32.1%
llama 8B IQ2_M - 2.7 bpw	2.74 GiB	8.03 B	Vulkan	99	0	tg128	98.11 ± 0.10	97.28 ± 0.15	-0.8%
llama 8B IQ2_M - 2.7 bpw	2.74 GiB	8.03 B	Vulkan	99	1	pp512	1018.82 ± 7.00	684.62 ± 1.07	-32.8%
llama 8B IQ2_M - 2.7 bpw	2.74 GiB	8.03 B	Vulkan	99	1	tg128	95.57 ± 0.21	95.07 ± 0.21	-0.5%
llama 8B IQ4_XS - 4.25 bpw	4.13 GiB	8.03 B	Vulkan	99	0	pp512	959.25 ± 11.63	658.00 ± 10.11	-31.4%
llama 8B IQ4_XS - 4.25 bpw	4.13 GiB	8.03 B	Vulkan	99	0	tg128	84.40 ± 0.27	83.88 ± 0.27	-0.6%
llama 8B IQ4_XS - 4.25 bpw	4.13 GiB	8.03 B	Vulkan	99	1	pp512	911.18 ± 3.10	638.61 ± 5.44	-29.9%
llama 8B IQ4_XS - 4.25 bpw	4.13 GiB	8.03 B	Vulkan	99	1	tg128	83.23 ± 0.05	82.69 ± 0.29	-0.6%
llama 8B Q4_K - Small	4.36 GiB	8.03 B	Vulkan	99	0	pp512	932.01 ± 16.49	623.06 ± 14.78	-33.1%
llama 8B Q4_K - Small	4.36 GiB	8.03 B	Vulkan	99	0	tg128	121.33 ± 0.36	120.42 ± 0.64	-0.8%
llama 8B Q4_K - Small	4.36 GiB	8.03 B	Vulkan	99	1	pp512	900.32 ± 9.58	619.17 ± 4.98	-31.2%
llama 8B Q4_K - Small	4.36 GiB	8.03 B	Vulkan	99	1	tg128	114.76 ± 0.69	113.73 ± 0.10	-0.9%
llama 8B Q4_0	4.33 GiB	8.03 B	Vulkan	99	0	pp512	1076.98 ± 31.73	706.16 ± 9.69	-34.4%
llama 8B Q4_0	4.33 GiB	8.03 B	Vulkan	99	0	tg128	136.72 ± 0.38	134.50 ± 1.09	-1.6%
llama 8B Q4_0	4.33 GiB	8.03 B	Vulkan	99	1	pp512	1016.89 ± 37.52	689.40 ± 3.62	-32.2%
llama 8B Q4_0	4.33 GiB	8.03 B	Vulkan	99	1	tg128	130.99 ± 0.25	131.13 ± 0.11	+0.1%
llama 8B Q4_1	4.77 GiB	8.03 B	Vulkan	99	0	pp512	1052.20 ± 25.71	701.55 ± 10.18	-33.3%
llama 8B Q4_1	4.77 GiB	8.03 B	Vulkan	99	0	tg128	126.23 ± 0.45	126.20 ± 0.34	-0.0%
llama 8B Q4_1	4.77 GiB	8.03 B	Vulkan	99	1	pp512	1002.47 ± 6.13	681.43 ± 6.17	-32.0%
llama 8B Q4_1	4.77 GiB	8.03 B	Vulkan	99	1	tg128	122.55 ± 0.22	122.88 ± 0.27	+0.3%
llama 8B Q8_0	7.95 GiB	8.03 B	Vulkan	99	0	pp512	1044.00 ± 20.93	705.14 ± 10.96	-32.5%
llama 8B Q8_0	7.95 GiB	8.03 B	Vulkan	99	0	tg128	91.51 ± 0.16	91.40 ± 0.16	-0.1%
llama 8B Q8_0	7.95 GiB	8.03 B	Vulkan	99	1	pp512	978.13 ± 10.50	679.43 ± 5.04	-30.5%
llama 8B Q8_0	7.95 GiB	8.03 B	Vulkan	99	1	tg128	89.77 ± 0.10	90.40 ± 0.61	+0.7%

AMD Radeon Pro VII

model	size	params	backend	ngl	fa	test	t/s (before)	t/s (after)	diff
llama 8B IQ1_S - 1.5625 bpw	1.87 GiB	8.03 B	Vulkan	99	0	pp512	329.39 ± 0.45	332.38 ± 0.22	+0.9%
llama 8B IQ1_S - 1.5625 bpw	1.87 GiB	8.03 B	Vulkan	99	0	tg128	83.10 ± 0.39	82.44 ± 0.16	-0.8%
llama 8B IQ1_S - 1.5625 bpw	1.87 GiB	8.03 B	Vulkan	99	1	pp512	313.21 ± 0.25	316.30 ± 0.33	+1.0%
llama 8B IQ1_S - 1.5625 bpw	1.87 GiB	8.03 B	Vulkan	99	1	tg128	78.20 ± 0.50	77.65 ± 0.17	-0.7%
llama 8B IQ2_M - 2.7 bpw	2.74 GiB	8.03 B	Vulkan	99	0	pp512	313.72 ± 0.26	324.10 ± 0.32	+3.3%
llama 8B IQ2_M - 2.7 bpw	2.74 GiB	8.03 B	Vulkan	99	0	tg128	70.35 ± 0.08	70.40 ± 0.12	+0.1%
llama 8B IQ2_M - 2.7 bpw	2.74 GiB	8.03 B	Vulkan	99	1	pp512	301.52 ± 0.17	310.86 ± 0.18	+3.1%
llama 8B IQ2_M - 2.7 bpw	2.74 GiB	8.03 B	Vulkan	99	1	tg128	67.19 ± 0.08	66.92 ± 0.24	-0.4%
llama 8B IQ4_XS - 4.25 bpw	4.13 GiB	8.03 B	Vulkan	99	0	pp512	301.87 ± 0.17	305.47 ± 0.15	+1.2%
llama 8B IQ4_XS - 4.25 bpw	4.13 GiB	8.03 B	Vulkan	99	0	tg128	52.41 ± 0.19	51.68 ± 0.02	-1.4%
llama 8B IQ4_XS - 4.25 bpw	4.13 GiB	8.03 B	Vulkan	99	1	pp512	289.62 ± 0.25	291.86 ± 0.77	+0.8%
llama 8B IQ4_XS - 4.25 bpw	4.13 GiB	8.03 B	Vulkan	99	1	tg128	50.57 ± 0.10	49.91 ± 0.16	-1.3%
llama 8B Q4_K - Small	4.36 GiB	8.03 B	Vulkan	99	0	pp512	291.44 ± 0.30	293.55 ± 0.40	+0.7%
llama 8B Q4_K - Small	4.36 GiB	8.03 B	Vulkan	99	0	tg128	71.63 ± 1.49	72.07 ± 0.21	+0.6%
llama 8B Q4_K - Small	4.36 GiB	8.03 B	Vulkan	99	1	pp512	279.90 ± 0.09	281.62 ± 0.36	+0.6%
llama 8B Q4_K - Small	4.36 GiB	8.03 B	Vulkan	99	1	tg128	69.85 ± 0.16	68.78 ± 0.44	-1.5%
llama 8B Q4_0	4.33 GiB	8.03 B	Vulkan	99	0	pp512	738.77 ± 0.35	723.68 ± 5.41	-2.0%
llama 8B Q4_0	4.33 GiB	8.03 B	Vulkan	99	0	tg128	98.54 ± 0.46	93.41 ± 1.09	-5.2%
llama 8B Q4_0	4.33 GiB	8.03 B	Vulkan	99	1	pp512	667.78 ± 0.57	657.61 ± 2.00	-1.5%
llama 8B Q4_0	4.33 GiB	8.03 B	Vulkan	99	1	tg128	91.73 ± 0.63	86.98 ± 0.33	-5.2%
llama 8B Q4_1	4.77 GiB	8.03 B	Vulkan	99	0	pp512	644.67 ± 0.27	641.24 ± 0.95	-0.5%
llama 8B Q4_1	4.77 GiB	8.03 B	Vulkan	99	0	tg128	102.06 ± 0.77	96.76 ± 0.26	-5.2%
llama 8B Q4_1	4.77 GiB	8.03 B	Vulkan	99	1	pp512	590.17 ± 0.46	581.79 ± 0.55	-1.4%
llama 8B Q4_1	4.77 GiB	8.03 B	Vulkan	99	1	tg128	93.57 ± 0.49	89.83 ± 0.36	-4.0%
llama 8B Q8_0	7.95 GiB	8.03 B	Vulkan	99	0	pp512	647.84 ± 0.44	639.25 ± 4.52	-1.3%
llama 8B Q8_0	7.95 GiB	8.03 B	Vulkan	99	0	tg128	73.12 ± 0.14	68.81 ± 0.07	-5.9%
llama 8B Q8_0	7.95 GiB	8.03 B	Vulkan	99	1	pp512	591.01 ± 0.49	584.45 ± 0.11	-1.1%
llama 8B Q8_0	7.95 GiB	8.03 B	Vulkan	99	1	tg128	66.84 ± 1.17	65.50 ± 0.03	-2.0%

without integer dot:

model	size	params	backend	ngl	fa	test	t/s (before)	t/s (after)	diff
llama 8B IQ1_S - 1.5625 bpw	1.87 GiB	8.03 B	Vulkan	99	0	pp512	326.17 ± 0.69	331.83 ± 0.47	+1.7%
llama 8B IQ1_S - 1.5625 bpw	1.87 GiB	8.03 B	Vulkan	99	0	tg128	82.59 ± 0.19	82.84 ± 0.17	+0.3%
llama 8B IQ1_S - 1.5625 bpw	1.87 GiB	8.03 B	Vulkan	99	1	pp512	309.99 ± 0.15	314.68 ± 0.53	+1.5%
llama 8B IQ1_S - 1.5625 bpw	1.87 GiB	8.03 B	Vulkan	99	1	tg128	77.47 ± 0.12	77.83 ± 0.29	+0.5%
llama 8B IQ2_M - 2.7 bpw	2.74 GiB	8.03 B	Vulkan	99	0	pp512	310.69 ± 0.17	322.13 ± 0.23	+3.7%
llama 8B IQ2_M - 2.7 bpw	2.74 GiB	8.03 B	Vulkan	99	0	tg128	70.41 ± 0.12	70.65 ± 0.06	+0.3%
llama 8B IQ2_M - 2.7 bpw	2.74 GiB	8.03 B	Vulkan	99	1	pp512	300.37 ± 0.17	311.01 ± 0.20	+3.5%
llama 8B IQ2_M - 2.7 bpw	2.74 GiB	8.03 B	Vulkan	99	1	tg128	67.12 ± 0.08	67.38 ± 0.04	+0.4%
llama 8B IQ4_XS - 4.25 bpw	4.13 GiB	8.03 B	Vulkan	99	0	pp512	299.74 ± 1.32	305.85 ± 0.31	+2.0%
llama 8B IQ4_XS - 4.25 bpw	4.13 GiB	8.03 B	Vulkan	99	0	tg128	52.70 ± 0.29	52.20 ± 0.04	-0.9%
llama 8B IQ4_XS - 4.25 bpw	4.13 GiB	8.03 B	Vulkan	99	1	pp512	289.29 ± 0.30	294.30 ± 0.19	+1.7%
llama 8B IQ4_XS - 4.25 bpw	4.13 GiB	8.03 B	Vulkan	99	1	tg128	50.85 ± 0.09	50.50 ± 0.03	-0.7%
llama 8B Q4_K - Small	4.36 GiB	8.03 B	Vulkan	99	0	pp512	290.47 ± 0.22	295.79 ± 1.78	+1.8%
llama 8B Q4_K - Small	4.36 GiB	8.03 B	Vulkan	99	0	tg128	73.39 ± 0.16	73.35 ± 0.10	-0.1%
llama 8B Q4_K - Small	4.36 GiB	8.03 B	Vulkan	99	1	pp512	279.92 ± 0.21	283.10 ± 0.12	+1.1%
llama 8B Q4_K - Small	4.36 GiB	8.03 B	Vulkan	99	1	tg128	70.09 ± 0.07	70.27 ± 0.07	+0.3%
llama 8B Q4_0	4.33 GiB	8.03 B	Vulkan	99	0	pp512	336.16 ± 0.33	339.18 ± 0.71	+0.9%
llama 8B Q4_0	4.33 GiB	8.03 B	Vulkan	99	0	tg128	75.25 ± 0.29	75.26 ± 0.08	+0.0%
llama 8B Q4_0	4.33 GiB	8.03 B	Vulkan	99	1	pp512	321.76 ± 0.26	326.49 ± 0.83	+1.5%
llama 8B Q4_0	4.33 GiB	8.03 B	Vulkan	99	1	tg128	71.64 ± 0.09	71.64 ± 0.07	+0.0%
llama 8B Q4_1	4.77 GiB	8.03 B	Vulkan	99	0	pp512	338.00 ± 0.24	341.55 ± 0.77	+1.1%
llama 8B Q4_1	4.77 GiB	8.03 B	Vulkan	99	0	tg128	67.96 ± 0.07	67.98 ± 0.08	+0.0%
llama 8B Q4_1	4.77 GiB	8.03 B	Vulkan	99	1	pp512	324.30 ± 0.15	326.77 ± 0.21	+0.8%
llama 8B Q4_1	4.77 GiB	8.03 B	Vulkan	99	1	tg128	65.09 ± 0.06	65.12 ± 0.14	+0.0%
llama 8B Q8_0	7.95 GiB	8.03 B	Vulkan	99	0	pp512	330.27 ± 0.30	334.57 ± 0.41	+1.3%
llama 8B Q8_0	7.95 GiB	8.03 B	Vulkan	99	0	tg128	59.72 ± 0.07	59.70 ± 0.07	-0.0%
llama 8B Q8_0	7.95 GiB	8.03 B	Vulkan	99	1	pp512	316.34 ± 0.18	320.76 ± 0.20	+1.4%
llama 8B Q8_0	7.95 GiB	8.03 B	Vulkan	99	1	tg128	57.59 ± 0.13	57.51 ± 0.07	-0.1%

fp32 only:

model	size	params	backend	ngl	fa	test	t/s (before)	t/s (after)	diff
llama 8B IQ1_S - 1.5625 bpw	1.87 GiB	8.03 B	Vulkan	99	0	pp512	453.65 ± 2.42	465.19 ± 0.65	+2.5%
llama 8B IQ1_S - 1.5625 bpw	1.87 GiB	8.03 B	Vulkan	99	0	tg128	82.31 ± 0.29	82.27 ± 0.17	-0.0%
llama 8B IQ1_S - 1.5625 bpw	1.87 GiB	8.03 B	Vulkan	99	1	pp512	426.52 ± 0.54	437.85 ± 0.70	+2.7%
llama 8B IQ1_S - 1.5625 bpw	1.87 GiB	8.03 B	Vulkan	99	1	tg128	76.91 ± 0.05	77.63 ± 0.15	+0.9%
llama 8B IQ2_M - 2.7 bpw	2.74 GiB	8.03 B	Vulkan	99	0	pp512	407.66 ± 0.92	422.86 ± 0.83	+3.7%
llama 8B IQ2_M - 2.7 bpw	2.74 GiB	8.03 B	Vulkan	99	0	tg128	69.88 ± 0.06	70.44 ± 0.13	+0.8%
llama 8B IQ2_M - 2.7 bpw	2.74 GiB	8.03 B	Vulkan	99	1	pp512	386.83 ± 1.02	401.07 ± 0.23	+3.7%
llama 8B IQ2_M - 2.7 bpw	2.74 GiB	8.03 B	Vulkan	99	1	tg128	66.58 ± 0.14	66.93 ± 0.05	+0.5%
llama 8B IQ4_XS - 4.25 bpw	4.13 GiB	8.03 B	Vulkan	99	0	pp512	417.37 ± 0.34	428.36 ± 2.82	+2.6%
llama 8B IQ4_XS - 4.25 bpw	4.13 GiB	8.03 B	Vulkan	99	0	tg128	51.35 ± 0.05	51.74 ± 0.05	+0.8%
llama 8B IQ4_XS - 4.25 bpw	4.13 GiB	8.03 B	Vulkan	99	1	pp512	392.04 ± 0.78	401.79 ± 1.51	+2.5%
llama 8B IQ4_XS - 4.25 bpw	4.13 GiB	8.03 B	Vulkan	99	1	tg128	49.60 ± 0.08	50.01 ± 0.07	+0.8%
llama 8B Q4_K - Small	4.36 GiB	8.03 B	Vulkan	99	0	pp512	391.88 ± 0.41	404.59 ± 0.73	+3.2%
llama 8B Q4_K - Small	4.36 GiB	8.03 B	Vulkan	99	0	tg128	71.91 ± 0.20	72.84 ± 0.12	+1.3%
llama 8B Q4_K - Small	4.36 GiB	8.03 B	Vulkan	99	1	pp512	374.42 ± 0.40	382.74 ± 0.23	+2.2%
llama 8B Q4_K - Small	4.36 GiB	8.03 B	Vulkan	99	1	tg128	68.60 ± 0.21	69.55 ± 0.08	+1.4%
llama 8B Q4_0	4.33 GiB	8.03 B	Vulkan	99	0	pp512	467.25 ± 0.18	480.87 ± 0.44	+2.9%
llama 8B Q4_0	4.33 GiB	8.03 B	Vulkan	99	0	tg128	74.20 ± 0.17	74.63 ± 0.12	+0.6%
llama 8B Q4_0	4.33 GiB	8.03 B	Vulkan	99	1	pp512	437.94 ± 0.16	447.99 ± 0.51	+2.3%
llama 8B Q4_0	4.33 GiB	8.03 B	Vulkan	99	1	tg128	70.34 ± 0.22	70.55 ± 0.19	+0.3%
llama 8B Q4_1	4.77 GiB	8.03 B	Vulkan	99	0	pp512	467.31 ± 0.54	481.17 ± 0.46	+3.0%
llama 8B Q4_1	4.77 GiB	8.03 B	Vulkan	99	0	tg128	67.30 ± 0.08	67.23 ± 0.13	-0.1%
llama 8B Q4_1	4.77 GiB	8.03 B	Vulkan	99	1	pp512	439.78 ± 0.34	449.46 ± 0.25	+2.2%
llama 8B Q4_1	4.77 GiB	8.03 B	Vulkan	99	1	tg128	64.36 ± 0.10	63.88 ± 0.19	-0.7%
llama 8B Q8_0	7.95 GiB	8.03 B	Vulkan	99	0	pp512	454.17 ± 0.24	469.37 ± 0.32	+3.3%
llama 8B Q8_0	7.95 GiB	8.03 B	Vulkan	99	0	tg128	58.73 ± 0.21	58.24 ± 0.11	-0.8%
llama 8B Q8_0	7.95 GiB	8.03 B	Vulkan	99	1	pp512	426.97 ± 0.49	437.85 ± 0.43	+2.5%
llama 8B Q8_0	7.95 GiB	8.03 B	Vulkan	99	1	tg128	56.23 ± 0.12	55.94 ± 0.07	-0.5%

AMD Radeon RX 6800 XT

model	size	params	backend	ngl	fa	test	t/s (before)	t/s (after)	diff
llama 8B Q4_K - Small	4.36 GiB	8.03 B	Vulkan	99	0	pp512	767.64 ± 0.58	747.48 ± 0.32	-2.6%
llama 8B Q4_K - Small	4.36 GiB	8.03 B	Vulkan	99	0	tg128	93.12 ± 0.83	93.41 ± 0.01	+0.3%
llama 8B Q4_K - Small	4.36 GiB	8.03 B	Vulkan	99	1	pp512	748.10 ± 0.34	731.19 ± 0.34	-2.3%
llama 8B Q4_K - Small	4.36 GiB	8.03 B	Vulkan	99	1	tg128	88.36 ± 0.02	88.23 ± 0.01	-0.1%
llama 8B Q4_0	4.33 GiB	8.03 B	Vulkan	99	0	pp512	1666.38 ± 1.89	1655.20 ± 1.68	-0.7%
llama 8B Q4_0	4.33 GiB	8.03 B	Vulkan	99	0	tg128	92.68 ± 0.03	92.67 ± 0.01	-0.0%
llama 8B Q4_0	4.33 GiB	8.03 B	Vulkan	99	1	pp512	1581.70 ± 0.81	1579.72 ± 0.73	-0.1%
llama 8B Q4_0	4.33 GiB	8.03 B	Vulkan	99	1	tg128	87.69 ± 0.02	87.67 ± 0.01	-0.0%
llama 8B Q4_1	4.77 GiB	8.03 B	Vulkan	99	0	pp512	1622.83 ± 1.29	1613.93 ± 2.53	-0.5%
llama 8B Q4_1	4.77 GiB	8.03 B	Vulkan	99	0	tg128	85.22 ± 0.01	85.23 ± 0.01	+0.0%
llama 8B Q4_1	4.77 GiB	8.03 B	Vulkan	99	1	pp512	1543.40 ± 0.72	1541.32 ± 1.24	-0.1%
llama 8B Q4_1	4.77 GiB	8.03 B	Vulkan	99	1	tg128	80.58 ± 0.02	80.59 ± 0.01	+0.0%
llama 8B Q8_0	7.95 GiB	8.03 B	Vulkan	99	0	pp512	1399.66 ± 1.58	1392.71 ± 1.31	-0.5%
llama 8B Q8_0	7.95 GiB	8.03 B	Vulkan	99	0	tg128	56.55 ± 0.00	56.55 ± 0.00	+0.0%
llama 8B Q8_0	7.95 GiB	8.03 B	Vulkan	99	1	pp512	1339.90 ± 0.28	1338.06 ± 0.67	-0.1%
llama 8B Q8_0	7.95 GiB	8.03 B	Vulkan	99	1	tg128	53.81 ± 0.01	53.78 ± 0.01	-0.1%

without integer dot:

model	size	params	backend	ngl	fa	test	t/s (before)	t/s (after)	diff
llama 8B Q4_K - Small	4.36 GiB	8.03 B	Vulkan	99	0	pp512	764.87 ± 0.84	750.44 ± 0.67	-1.9%
llama 8B Q4_K - Small	4.36 GiB	8.03 B	Vulkan	99	0	tg128	93.43 ± 0.01	93.45 ± 0.01	+0.0%
llama 8B Q4_K - Small	4.36 GiB	8.03 B	Vulkan	99	1	pp512	745.07 ± 0.31	732.19 ± 0.32	-1.7%
llama 8B Q4_K - Small	4.36 GiB	8.03 B	Vulkan	99	1	tg128	88.14 ± 0.01	88.16 ± 0.01	+0.0%
llama 8B Q4_0	4.33 GiB	8.03 B	Vulkan	99	0	pp512	951.68 ± 0.65	950.55 ± 0.71	-0.1%
llama 8B Q4_0	4.33 GiB	8.03 B	Vulkan	99	0	tg128	94.56 ± 0.04	94.59 ± 0.01	+0.0%
llama 8B Q4_0	4.33 GiB	8.03 B	Vulkan	99	1	pp512	924.07 ± 0.24	922.76 ± 1.03	-0.1%
llama 8B Q4_0	4.33 GiB	8.03 B	Vulkan	99	1	tg128	89.46 ± 0.00	89.44 ± 0.02	-0.0%
llama 8B Q4_1	4.77 GiB	8.03 B	Vulkan	99	0	pp512	965.97 ± 0.70	959.26 ± 1.08	-0.7%
llama 8B Q4_1	4.77 GiB	8.03 B	Vulkan	99	0	tg128	79.54 ± 0.01	79.56 ± 0.01	+0.0%
llama 8B Q4_1	4.77 GiB	8.03 B	Vulkan	99	1	pp512	937.70 ± 0.78	931.57 ± 0.44	-0.7%
llama 8B Q4_1	4.77 GiB	8.03 B	Vulkan	99	1	tg128	75.18 ± 0.02	75.28 ± 0.01	+0.1%
llama 8B Q8_0	7.95 GiB	8.03 B	Vulkan	99	0	pp512	930.01 ± 0.92	916.18 ± 0.79	-1.5%
llama 8B Q8_0	7.95 GiB	8.03 B	Vulkan	99	0	tg128	56.55 ± 0.01	56.54 ± 0.01	-0.0%
llama 8B Q8_0	7.95 GiB	8.03 B	Vulkan	99	1	pp512	902.92 ± 0.48	890.01 ± 0.36	-1.4%
llama 8B Q8_0	7.95 GiB	8.03 B	Vulkan	99	1	tg128	53.81 ± 0.01	53.79 ± 0.01	-0.0%

Intel A770

model	size	params	backend	ngl	fa	test	t/s (before)	t/s (after)	diff
llama 8B IQ1_S - 1.5625 bpw	1.87 GiB	8.03 B	Vulkan	99	0	pp512	102.71 ± 0.07	302.83 ± 0.49	+194.8%
llama 8B IQ1_S - 1.5625 bpw	1.87 GiB	8.03 B	Vulkan	99	0	tg128	32.75 ± 0.02	32.79 ± 0.02	+0.1%
llama 8B IQ1_S - 1.5625 bpw	1.87 GiB	8.03 B	Vulkan	99	1	pp512	78.95 ± 0.10	104.01 ± 0.05	+31.7%
llama 8B IQ1_S - 1.5625 bpw	1.87 GiB	8.03 B	Vulkan	99	1	tg128	25.36 ± 0.05	25.38 ± 0.00	+0.1%
llama 8B IQ2_M - 2.7 bpw	2.74 GiB	8.03 B	Vulkan	99	0	pp512	99.17 ± 0.07	234.19 ± 0.41	+136.2%
llama 8B IQ2_M - 2.7 bpw	2.74 GiB	8.03 B	Vulkan	99	0	tg128	18.40 ± 0.01	18.45 ± 0.01	+0.3%
llama 8B IQ2_M - 2.7 bpw	2.74 GiB	8.03 B	Vulkan	99	1	pp512	76.45 ± 0.07	116.61 ± 0.06	+52.5%
llama 8B IQ2_M - 2.7 bpw	2.74 GiB	8.03 B	Vulkan	99	1	tg128	15.89 ± 0.02	15.92 ± 0.03	+0.2%
llama 8B IQ4_XS - 4.25 bpw	4.13 GiB	8.03 B	Vulkan	99	0	pp512	99.72 ± 0.10	233.54 ± 1.02	+134.2%
llama 8B IQ4_XS - 4.25 bpw	4.13 GiB	8.03 B	Vulkan	99	0	tg128	16.17 ± 0.01	16.19 ± 0.03	+0.1%
llama 8B IQ4_XS - 4.25 bpw	4.13 GiB	8.03 B	Vulkan	99	1	pp512	76.37 ± 0.12	107.72 ± 0.07	+41.1%
llama 8B IQ4_XS - 4.25 bpw	4.13 GiB	8.03 B	Vulkan	99	1	tg128	14.36 ± 0.00	14.38 ± 0.00	+0.1%
llama 8B Q4_K - Small	4.36 GiB	8.03 B	Vulkan	99	0	pp512	98.99 ± 0.17	229.17 ± 0.46	+131.5%
llama 8B Q4_K - Small	4.36 GiB	8.03 B	Vulkan	99	0	tg128	37.99 ± 0.03	38.06 ± 0.02	+0.2%
llama 8B Q4_K - Small	4.36 GiB	8.03 B	Vulkan	99	1	pp512	77.42 ± 0.07	105.78 ± 0.09	+36.6%
llama 8B Q4_K - Small	4.36 GiB	8.03 B	Vulkan	99	1	tg128	28.11 ± 0.01	28.08 ± 0.04	-0.1%
llama 8B Q4_0	4.33 GiB	8.03 B	Vulkan	99	0	pp512	830.11 ± 1.23	904.34 ± 1.77	+8.9%
llama 8B Q4_0	4.33 GiB	8.03 B	Vulkan	99	0	tg128	38.91 ± 0.04	38.98 ± 0.01	+0.2%
llama 8B Q4_0	4.33 GiB	8.03 B	Vulkan	99	1	pp512	252.63 ± 0.11	252.76 ± 0.08	+0.1%
llama 8B Q4_0	4.33 GiB	8.03 B	Vulkan	99	1	tg128	28.87 ± 0.00	28.88 ± 0.02	+0.0%
llama 8B Q4_1	4.77 GiB	8.03 B	Vulkan	99	0	pp512	828.89 ± 1.68	900.07 ± 1.16	+8.6%
llama 8B Q4_1	4.77 GiB	8.03 B	Vulkan	99	0	tg128	44.91 ± 0.19	44.95 ± 0.05	+0.1%
llama 8B Q4_1	4.77 GiB	8.03 B	Vulkan	99	1	pp512	252.75 ± 0.09	252.59 ± 0.22	-0.1%
llama 8B Q4_1	4.77 GiB	8.03 B	Vulkan	99	1	tg128	31.70 ± 0.02	31.74 ± 0.01	+0.1%
llama 8B Q8_0	7.95 GiB	8.03 B	Vulkan	99	0	pp512	742.66 ± 1.56	797.29 ± 1.28	+7.4%
llama 8B Q8_0	7.95 GiB	8.03 B	Vulkan	99	0	tg128	35.37 ± 0.02	35.42 ± 0.01	+0.1%
llama 8B Q8_0	7.95 GiB	8.03 B	Vulkan	99	1	pp512	243.30 ± 0.12	243.36 ± 0.11	+0.0%
llama 8B Q8_0	7.95 GiB	8.03 B	Vulkan	99	1	tg128	26.82 ± 0.03	26.87 ± 0.01	+0.2%

In summary, positive for Nvidia coopmat1, slightly positive for AMD Vega20, slightly negative for AMD RDNA2, and very positive for Intel Alchemist.

netrunnereve · 2025-09-18T21:19:47Z

Here's a quick test on my 470, it runs slightly faster or slower depending on the quant used. From looking at the changes it shouldn't make a difference but I guess the driver's compiling things a bit differently.

As for your results the only thing that really matters are the non dp4a p512 numbers. Something is really really wrong with those Nvidia runs and I have a feeling that this might have terrible results on a Maxwell or Pascal chip. Meanwhile for AMD did you check if you're using the V_DOT2_F32_F16 instruction? At least with RGP I've never been able to get it to generate that even if I use dot() on two fp16 vectors.

Intel's good so that's good, and it looks like that it's now using the dot product instructions.

PR:

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	185.77 ± 1.52
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	38.76 ± 0.08

  MUL_MAT(type_a=f32,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                    26 runs - 41540.58 us/run -  60.13 GFLOP/run -   1.45 TFLOPS
  MUL_MAT(type_a=f16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                    40 runs - 25755.83 us/run -  60.13 GFLOP/run -   2.33 TFLOPS
  MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                   38 runs - 26883.37 us/run -  60.13 GFLOP/run -   2.24 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                   44 runs - 22990.64 us/run -  60.13 GFLOP/run -   2.62 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                   44 runs - 22889.14 us/run -  60.13 GFLOP/run -   2.63 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                   42 runs - 24746.45 us/run -  60.13 GFLOP/run -   2.43 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                   42 runs - 24276.55 us/run -  60.13 GFLOP/run -   2.48 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                   42 runs - 24019.43 us/run -  60.13 GFLOP/run -   2.50 TFLOPS
  MUL_MAT(type_a=mxfp4,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  46 runs - 22668.83 us/run -  60.13 GFLOP/run -   2.65 TFLOPS
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                   42 runs - 24327.69 us/run -  60.13 GFLOP/run -   2.47 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                   38 runs - 26900.53 us/run -  60.13 GFLOP/run -   2.24 TFLOPS
  MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                   36 runs - 27928.81 us/run -  60.13 GFLOP/run -   2.15 TFLOPS
  MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                   34 runs - 30028.00 us/run -  60.13 GFLOP/run -   2.00 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                   36 runs - 28748.83 us/run -  60.13 GFLOP/run -   2.09 TFLOPS
  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                44 runs - 23652.18 us/run -  60.13 GFLOP/run -   2.54 TFLOPS
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                 44 runs - 23300.91 us/run -  60.13 GFLOP/run -   2.58 TFLOPS
  MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  38 runs - 27562.39 us/run -  60.13 GFLOP/run -   2.18 TFLOPS
  MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                42 runs - 24565.81 us/run -  60.13 GFLOP/run -   2.45 TFLOPS
  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  44 runs - 23326.89 us/run -  60.13 GFLOP/run -   2.58 TFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  42 runs - 23996.62 us/run -  60.13 GFLOP/run -   2.51 TFLOPS
  MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                 44 runs - 23616.02 us/run -  60.13 GFLOP/run -   2.55 TFLOPS
  MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  42 runs - 24867.29 us/run -  60.13 GFLOP/run -   2.42 TFLOPS
  MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                 38 runs - 26893.34 us/run -  60.13 GFLOP/run -   2.24 TFLOPS

Master:

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	191.66 ± 1.13
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	38.23 ± 0.15

  MUL_MAT(type_a=f32,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                    24 runs - 42105.12 us/run -  60.13 GFLOP/run -   1.43 TFLOPS
  MUL_MAT(type_a=f16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                    38 runs - 26393.45 us/run -  60.13 GFLOP/run -   2.28 TFLOPS
  MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                   36 runs - 28066.75 us/run -  60.13 GFLOP/run -   2.14 TFLOPS
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                   44 runs - 23684.27 us/run -  60.13 GFLOP/run -   2.54 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                   44 runs - 23501.84 us/run -  60.13 GFLOP/run -   2.56 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                   40 runs - 25421.35 us/run -  60.13 GFLOP/run -   2.37 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                   42 runs - 24880.21 us/run -  60.13 GFLOP/run -   2.42 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                   42 runs - 24610.71 us/run -  60.13 GFLOP/run -   2.44 TFLOPS
  MUL_MAT(type_a=mxfp4,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  44 runs - 23247.95 us/run -  60.13 GFLOP/run -   2.59 TFLOPS
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                   40 runs - 25229.40 us/run -  60.13 GFLOP/run -   2.38 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                   38 runs - 27730.39 us/run -  60.13 GFLOP/run -   2.17 TFLOPS
  MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                   36 runs - 28658.36 us/run -  60.13 GFLOP/run -   2.10 TFLOPS
  MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                   34 runs - 30667.38 us/run -  60.13 GFLOP/run -   1.96 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                   36 runs - 29375.75 us/run -  60.13 GFLOP/run -   2.05 TFLOPS
  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                42 runs - 24167.88 us/run -  60.13 GFLOP/run -   2.49 TFLOPS
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                 42 runs - 24162.33 us/run -  60.13 GFLOP/run -   2.49 TFLOPS
  MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  36 runs - 28005.72 us/run -  60.13 GFLOP/run -   2.15 TFLOPS
  MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                42 runs - 24828.86 us/run -  60.13 GFLOP/run -   2.42 TFLOPS
  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  44 runs - 23753.23 us/run -  60.13 GFLOP/run -   2.53 TFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  42 runs - 24208.48 us/run -  60.13 GFLOP/run -   2.48 TFLOPS
  MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                 42 runs - 24274.55 us/run -  60.13 GFLOP/run -   2.48 TFLOPS
  MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  40 runs - 25486.95 us/run -  60.13 GFLOP/run -   2.36 TFLOPS
  MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                 38 runs - 27545.03 us/run -  60.13 GFLOP/run -   2.18 TFLOPS

0cc4m · 2025-09-19T06:23:10Z

@netrunnereve Thank you for the test. Yeah, I'm still torn on this. It's good for Intel and coopmat1, but slightly negative for other cases, very negative on Nvidia and on Apple. I'll put it back on draft for now and try to figure out whether there's a way to improve this.

I did not test it with RGP, so maybe it's not actually hitting the instructions. But I'd have to check with RADV, not sure how that works.

@jeffbolznv Do you know why this is very bad on Ampere (non-coopmat) and whether this likely affects pre-Turing the same way?

jeffbolznv · 2025-09-19T12:19:39Z

@jeffbolznv Do you know why this is very bad on Ampere (non-coopmat) and whether this likely affects pre-Turing the same way?

I didn't see anything obvious in the change so I looked at the sass we generate. I think the change is due to the use of the dot intrinsic, our compiler will expand it into componentwise mul+add and has trouble re-vectorizing that into paired fma instructions.

I think shaderFloat16 is not enabled on most pre-Turing devices (for performance reasons) so they would probably not be affected. But a few, like Tesla P100(?) presumably would.

0cc4m · 2025-09-19T13:04:56Z

@jeffbolznv Do you know why this is very bad on Ampere (non-coopmat) and whether this likely affects pre-Turing the same way?

I didn't see anything obvious in the change so I looked at the sass we generate. I think the change is due to the use of the dot intrinsic, our compiler will expand it into componentwise mul+add and has trouble re-vectorizing that into paired fma instructions.

That's odd, I used dot specifically to create a situation where there are two independent multiplies/adds that should be easy to fuse into one of the dot or fma instructions that many GPU architectures have. I have to look at AMD assembly to see if it worked there.

jeffbolznv · 2025-09-19T13:28:01Z

Yeah, this is our compiler not doing a great job. It might work better to do fmas on the vectors and then combine the two components at the end.

0cc4m · 2025-09-19T15:48:36Z

@jeffbolznv You are correct, thank you. This diff fixes the performance issue on Nvidia and Apple:

diff --git a/ggml/src/ggml-vulkan/vulkan-shaders/mul_mm.comp b/ggml/src/ggml-vulkan/vulkan-shaders/mul_mm.comp
index d22230fad..38a4d07d0 100644
--- a/ggml/src/ggml-vulkan/vulkan-shaders/mul_mm.comp
+++ b/ggml/src/ggml-vulkan/vulkan-shaders/mul_mm.comp
@@ -357,7 +357,7 @@ void main() {
                     [[unroll]] for (uint cc = 0; cc < TN; cc++) {
                         [[unroll]] for (uint cr = 0; cr < TM; cr++) {
                             const uint sums_idx = (wsic * TN + cc) * (WMITER * TM) + wsir * TM + cr;
-                            sums[sums_idx] += dot(ACC_TYPE_VEC2(cache_a[wsir * TM + cr]), ACC_TYPE_VEC2(cache_b[cc]));
+                            sums[sums_idx] = fma(ACC_TYPE(cache_a[wsir * TM + cr].x), ACC_TYPE(cache_b[cc].x), fma(ACC_TYPE(cache_a[wsir * TM + cr].y), ACC_TYPE(cache_b[cc].y), sums[sums_idx]));
                         }
                     }
                 }

0cc4m · 2025-09-19T16:39:04Z

Here are updated results:

Nvidia RTX 3090 without coopmat or integer dot

model	size	params	backend	ngl	fa	test	t/s (before)	t/s (after)	diff
llama 8B IQ1_S - 1.5625 bpw	1.87 GiB	8.03 B	Vulkan	99	0	pp512	1127.73 ± 2.10	1244.38 ± 4.58	+10.3%
llama 8B IQ1_S - 1.5625 bpw	1.87 GiB	8.03 B	Vulkan	99	1	pp512	1113.12 ± 2.21	1228.22 ± 3.10	+10.3%
llama 8B IQ2_M - 2.7 bpw	2.74 GiB	8.03 B	Vulkan	99	0	pp512	1059.88 ± 5.29	1183.09 ± 2.50	+11.6%
llama 8B IQ2_M - 2.7 bpw	2.74 GiB	8.03 B	Vulkan	99	1	pp512	1042.59 ± 3.66	1168.04 ± 2.60	+12.0%
llama 8B IQ4_XS - 4.25 bpw	4.13 GiB	8.03 B	Vulkan	99	0	pp512	985.79 ± 3.83	1089.85 ± 4.18	+10.6%
llama 8B IQ4_XS - 4.25 bpw	4.13 GiB	8.03 B	Vulkan	99	1	pp512	974.48 ± 1.13	1073.09 ± 7.68	+10.1%
llama 8B Q4_K - Small	4.36 GiB	8.03 B	Vulkan	99	0	pp512	978.75 ± 3.61	1091.84 ± 5.65	+11.6%
llama 8B Q4_K - Small	4.36 GiB	8.03 B	Vulkan	99	1	pp512	970.51 ± 0.57	1066.46 ± 5.65	+9.9%
llama 8B Q4_0	4.33 GiB	8.03 B	Vulkan	99	0	pp512	1131.58 ± 5.90	1215.66 ± 15.93	+7.4%
llama 8B Q4_0	4.33 GiB	8.03 B	Vulkan	99	1	pp512	1113.63 ± 1.68	1188.07 ± 10.87	+6.7%
llama 8B Q4_1	4.77 GiB	8.03 B	Vulkan	99	0	pp512	1100.51 ± 3.50	1209.61 ± 8.40	+9.9%
llama 8B Q4_1	4.77 GiB	8.03 B	Vulkan	99	1	pp512	1085.11 ± 2.51	1179.40 ± 5.16	+8.7%
llama 8B Q8_0	7.95 GiB	8.03 B	Vulkan	99	0	pp512	1092.64 ± 4.17	1205.93 ± 3.23	+10.4%
llama 8B Q8_0	7.95 GiB	8.03 B	Vulkan	99	1	pp512	1079.56 ± 2.01	1180.21 ± 11.35	+9.3%

AMD Radeon Pro VII without integer dot

model	size	params	backend	ngl	fa	test	t/s (before)	t/s (after)	diff
llama 8B IQ1_S - 1.5625 bpw	1.87 GiB	8.03 B	Vulkan	99	0	pp512	327.01 ± 0.32	333.32 ± 1.92	+1.9%
llama 8B IQ1_S - 1.5625 bpw	1.87 GiB	8.03 B	Vulkan	99	1	pp512	312.49 ± 0.24	317.96 ± 0.36	+1.8%
llama 8B IQ2_M - 2.7 bpw	2.74 GiB	8.03 B	Vulkan	99	0	pp512	314.93 ± 0.19	323.17 ± 0.87	+2.6%
llama 8B IQ2_M - 2.7 bpw	2.74 GiB	8.03 B	Vulkan	99	1	pp512	302.94 ± 0.09	310.66 ± 0.48	+2.5%
llama 8B IQ4_XS - 4.25 bpw	4.13 GiB	8.03 B	Vulkan	99	0	pp512	303.29 ± 0.71	305.22 ± 0.72	+0.6%
llama 8B IQ4_XS - 4.25 bpw	4.13 GiB	8.03 B	Vulkan	99	1	pp512	291.26 ± 0.11	293.29 ± 0.43	+0.7%
llama 8B Q4_K - Small	4.36 GiB	8.03 B	Vulkan	99	0	pp512	293.15 ± 0.32	293.67 ± 1.06	+0.2%
llama 8B Q4_K - Small	4.36 GiB	8.03 B	Vulkan	99	1	pp512	281.64 ± 0.20	283.24 ± 0.56	+0.6%
llama 8B Q4_0	4.33 GiB	8.03 B	Vulkan	99	0	pp512	338.81 ± 1.06	339.25 ± 1.85	+0.1%
llama 8B Q4_0	4.33 GiB	8.03 B	Vulkan	99	1	pp512	323.43 ± 0.18	325.40 ± 0.65	+0.6%
llama 8B Q4_1	4.77 GiB	8.03 B	Vulkan	99	0	pp512	341.30 ± 0.24	340.93 ± 1.42	-0.1%
llama 8B Q4_1	4.77 GiB	8.03 B	Vulkan	99	1	pp512	326.30 ± 0.29	326.85 ± 0.86	+0.2%
llama 8B Q8_0	7.95 GiB	8.03 B	Vulkan	99	0	pp512	332.56 ± 0.26	335.34 ± 1.45	+0.8%
llama 8B Q8_0	7.95 GiB	8.03 B	Vulkan	99	1	pp512	317.07 ± 0.30	320.66 ± 0.77	+1.1%

AMD Radeon Pro VII without integer dot or fp16

model	size	params	backend	ngl	fa	test	t/s (before)	t/s (after)	diff
llama 8B IQ1_S - 1.5625 bpw	1.87 GiB	8.03 B	Vulkan	99	0	pp512	454.79 ± 2.11	468.57 ± 1.22	+3.0%
llama 8B IQ1_S - 1.5625 bpw	1.87 GiB	8.03 B	Vulkan	99	1	pp512	427.14 ± 1.50	439.25 ± 0.59	+2.8%
llama 8B IQ2_M - 2.7 bpw	2.74 GiB	8.03 B	Vulkan	99	0	pp512	410.62 ± 2.13	423.55 ± 1.34	+3.1%
llama 8B IQ2_M - 2.7 bpw	2.74 GiB	8.03 B	Vulkan	99	1	pp512	388.88 ± 0.76	401.95 ± 0.55	+3.4%
llama 8B IQ4_XS - 4.25 bpw	4.13 GiB	8.03 B	Vulkan	99	0	pp512	417.58 ± 2.28	423.92 ± 0.52	+1.5%
llama 8B IQ4_XS - 4.25 bpw	4.13 GiB	8.03 B	Vulkan	99	1	pp512	395.26 ± 1.12	401.45 ± 0.44	+1.6%
llama 8B Q4_K - Small	4.36 GiB	8.03 B	Vulkan	99	0	pp512	397.12 ± 2.83	400.94 ± 1.53	+1.0%
llama 8B Q4_K - Small	4.36 GiB	8.03 B	Vulkan	99	1	pp512	376.35 ± 0.71	381.93 ± 0.30	+1.5%
llama 8B Q4_0	4.33 GiB	8.03 B	Vulkan	99	0	pp512	471.54 ± 2.17	477.35 ± 0.45	+1.2%
llama 8B Q4_0	4.33 GiB	8.03 B	Vulkan	99	1	pp512	440.95 ± 1.48	448.43 ± 0.63	+1.7%
llama 8B Q4_1	4.77 GiB	8.03 B	Vulkan	99	0	pp512	471.33 ± 1.64	478.87 ± 0.47	+1.6%
llama 8B Q4_1	4.77 GiB	8.03 B	Vulkan	99	1	pp512	442.25 ± 1.09	450.15 ± 0.23	+1.8%
llama 8B Q8_0	7.95 GiB	8.03 B	Vulkan	99	0	pp512	455.63 ± 0.36	467.57 ± 0.85	+2.6%
llama 8B Q8_0	7.95 GiB	8.03 B	Vulkan	99	1	pp512	428.45 ± 0.27	439.68 ± 0.32	+2.6%

AMD Radeon RX 6800 XT without integer dot

model	size	params	backend	ngl	fa	test	t/s (before)	t/s (after)	diff
llama 8B IQ1_S - 1.5625 bpw	1.87 GiB	8.03 B	Vulkan	99	0	pp512	935.47 ± 0.58	933.18 ± 0.23	-0.2%
llama 8B IQ1_S - 1.5625 bpw	1.87 GiB	8.03 B	Vulkan	99	1	pp512	908.75 ± 0.53	906.94 ± 0.57	-0.2%
llama 8B IQ2_M - 2.7 bpw	2.74 GiB	8.03 B	Vulkan	99	0	pp512	915.66 ± 0.70	907.38 ± 0.93	-0.9%
llama 8B IQ2_M - 2.7 bpw	2.74 GiB	8.03 B	Vulkan	99	1	pp512	889.68 ± 0.22	883.55 ± 0.13	-0.7%
llama 8B IQ4_XS - 4.25 bpw	4.13 GiB	8.03 B	Vulkan	99	0	pp512	799.45 ± 0.50	792.41 ± 0.35	-0.9%
llama 8B IQ4_XS - 4.25 bpw	4.13 GiB	8.03 B	Vulkan	99	1	pp512	780.09 ± 0.14	774.30 ± 0.22	-0.7%
llama 8B Q4_K - Small	4.36 GiB	8.03 B	Vulkan	99	0	pp512	761.47 ± 0.44	747.38 ± 0.28	-1.9%
llama 8B Q4_K - Small	4.36 GiB	8.03 B	Vulkan	99	1	pp512	744.30 ± 0.25	731.15 ± 0.25	-1.8%
llama 8B Q4_0	4.33 GiB	8.03 B	Vulkan	99	0	pp512	951.67 ± 0.59	950.34 ± 0.55	-0.1%
llama 8B Q4_0	4.33 GiB	8.03 B	Vulkan	99	1	pp512	924.76 ± 0.40	924.53 ± 0.28	-0.0%
llama 8B Q4_1	4.77 GiB	8.03 B	Vulkan	99	0	pp512	966.97 ± 0.59	961.14 ± 0.37	-0.6%
llama 8B Q4_1	4.77 GiB	8.03 B	Vulkan	99	1	pp512	939.43 ± 0.68	934.73 ± 0.27	-0.5%
llama 8B Q8_0	7.95 GiB	8.03 B	Vulkan	99	0	pp512	930.58 ± 0.57	917.98 ± 0.62	-1.4%
llama 8B Q8_0	7.95 GiB	8.03 B	Vulkan	99	1	pp512	905.20 ± 0.37	893.14 ± 0.15	-1.3%

Intel A770 without integer dot

model	size	params	backend	ngl	fa	test	t/s (before)	t/s (after)	diff
llama 8B IQ1_S - 1.5625 bpw	1.87 GiB	8.03 B	Vulkan	99	0	pp512	102.81 ± 0.07	301.44 ± 0.85	+193.2%
llama 8B IQ1_S - 1.5625 bpw	1.87 GiB	8.03 B	Vulkan	99	1	pp512	78.92 ± 0.11	103.90 ± 0.04	+31.7%
llama 8B IQ2_M - 2.7 bpw	2.74 GiB	8.03 B	Vulkan	99	0	pp512	99.07 ± 0.08	234.28 ± 0.17	+136.5%
llama 8B IQ2_M - 2.7 bpw	2.74 GiB	8.03 B	Vulkan	99	1	pp512	76.43 ± 0.05	116.72 ± 0.04	+52.7%
llama 8B IQ4_XS - 4.25 bpw	4.13 GiB	8.03 B	Vulkan	99	0	pp512	99.56 ± 0.12	234.16 ± 0.24	+135.2%
llama 8B IQ4_XS - 4.25 bpw	4.13 GiB	8.03 B	Vulkan	99	1	pp512	76.43 ± 0.05	107.81 ± 0.13	+41.1%
llama 8B Q4_K - Small	4.36 GiB	8.03 B	Vulkan	99	0	pp512	98.93 ± 0.09	229.02 ± 0.51	+131.5%
llama 8B Q4_K - Small	4.36 GiB	8.03 B	Vulkan	99	1	pp512	77.38 ± 0.04	105.77 ± 0.06	+36.7%
llama 8B Q4_0	4.33 GiB	8.03 B	Vulkan	99	0	pp512	104.65 ± 0.02	291.24 ± 0.37	+178.3%
llama 8B Q4_0	4.33 GiB	8.03 B	Vulkan	99	1	pp512	79.63 ± 0.04	121.13 ± 0.07	+52.1%
llama 8B Q4_1	4.77 GiB	8.03 B	Vulkan	99	0	pp512	105.08 ± 0.10	290.53 ± 0.29	+176.5%
llama 8B Q4_1	4.77 GiB	8.03 B	Vulkan	99	1	pp512	80.50 ± 0.11	121.54 ± 0.04	+51.0%
llama 8B Q8_0	7.95 GiB	8.03 B	Vulkan	99	0	pp512	99.65 ± 0.13	268.29 ± 0.56	+169.2%
llama 8B Q8_0	7.95 GiB	8.03 B	Vulkan	99	1	pp512	76.89 ± 0.13	118.52 ± 0.02	+54.1%

@netrunnereve I think you are right, I see lots of v_fma_f16 in the Vega20 disassembly, but no v_pk_fma_f16, only a few cases of v_pk_mul_f16. Same with RDNA2. Not sure how I would trigger that or why it isn't triggering.

netrunnereve · 2025-09-19T17:42:24Z

I did not test it with RGP, so maybe it's not actually hitting the instructions. But I'd have to check with RADV, not sure how that works.

I think you figured this out already but for radv you can use RADV_DEBUG=shaders to get the assembly. To ensure that you're getting the correct shader run it along with test backend ops on a single test case only.

@netrunnereve I think you are right, I see lots of v_fma_f16 in the Vega20 disassembly, but no v_pk_fma_f16, only a few cases of v_pk_mul_f16. Same with RDNA2. Not sure how I would trigger that or why it isn't triggering.

Even a packed multiply or fma isn't good as that only does the calculation in parallel for the upper and lower 16 bits, and you'll still need additional instructions to add them together and convert it to fp32. With V_DOT2_F32_F16 this does the entire dot product, fp32 conversion, and a free addition with the previous sum all in one cycle.

Honestly this is something worthy of a Mesa ticket (idc about amdvlk anymore as amd got rid of it), and I doubt they've put a lot of effort into this as graphics apps mostly use fp32. Since RGP uses amdvlk in the background it looks like AMD hasn't looked into this use case either. Maybe there should be a shaderfloatingpointdotproduct vulkan extension for this.

0cc4m · 2025-09-19T18:44:32Z

Yeah, I looked for dot first, but I couldn't trigger any dot function on either GPU. At least a few of the packed ones triggered, but fma would have been more interesting than a few quant-specific packed muls.

I can trigger dot2 in RDNA4 in amdvlk RGP with this source code:

#version 450

#extension GL_EXT_shader_explicit_arithmetic_types_float16 : require

layout (binding = 0) readonly buffer A {f16vec2 data_a[];};
layout (binding = 1) writeonly buffer B {float data_b[];};

void main()
{
    uint index = gl_GlobalInvocationID.x;

	const f16vec2 val0 = data_a[index * 2];
	const f16vec2 val1 = data_a[index * 2 + 1];

	data_b[index] = dot(val0, val1);
}

For RDNA 4 this gives me:

22	 0x00008C	     v_dot2_f16_f16   	 v1.l,  v2,  v1,  0	 3,12	 
23	 0x000094	     v_cvt_f32_f16_e32	 v1,  v1.l         	 2,12

but for older generations at most a packed mul:

18	 0x000068	     v_pk_mul_f16     	 v1,  v1,  v2                                                                   	 3,8	 
19	 0x000070	     v_add_f16_sdwa   	 v1,  v1,  v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_1	 2,8	 
20	 0x000078	     v_cvt_f32_f16_e32	 v1,  v1                                                                        	 2,8

With RADV, for Vega20 I just get this:

        v_mul_f16_sdwa v1, s2, v0 dst_sel:WORD_0 dst_unused:UNUSED_PRESERVE src0_sel:WORD_1 src1_sel:DWORD ; 440200f9 06851402
        v_fma_f16 v1, s2, v2, v1                                    ; d2060001 04060402
        v_cvt_f32_f16_e32 v3, v1                                    ; 7e061701

Yeah, I think we should open an issue with Mesa. I don't think an extension would make it through, since it's basically just up to the compiler to spot the opportunity to use it.

jeffbolznv

I verified the latest commit is generating good code.

netrunnereve

For RDNA 4 this gives me:

And here they waste a clock cycle for the "conversion" when there's an instruction that already does that. Also v_dot2_f16_f16 is not a RDNA4 exclusive so it looks like there's some different compiler logic for that chip.

Yeah, I think we should open an issue with Mesa. I don't think an extension would make it through, since it's basically just up to the compiler to spot the opportunity to use it.

I mean the spirv literally contains a dot instruction and the compiler should recognise that, hey, this is a fp16 dot product and we have an instruction for it! The only thing an extension would do is force the use of the instruction and also indicate that the GPU has special hardware support for dot products.

I've been thinking of getting a MI25 or MI50 once I figure out if it's possible to cut a hole in the cover and place a fan inside. Adding those mini fans to the front is going to be loud and my computer doesn't have the space for it. If I manage to get that going I'll probably start looking into how Mesa handles the dot products internally.

Anyways I think this is good to merge now.

wbruna · 2025-09-21T12:25:13Z

This change seems to have broken prompt adherence on stable-diffusion.cpp: leejet/stable-diffusion.cpp#847 .

…rg#16056)" This reverts commit 803dac2.

…ggml-org#16056)"" This reverts commit 3170fc3.

etasnadi · 2025-09-22T11:36:43Z

sass

@jeffbolznv Do you know why this is very bad on Ampere (non-coopmat) and whether this likely affects pre-Turing the same way?

I didn't see anything obvious in the change so I looked at the sass we generate. I think the change is due to the use of the dot intrinsic, our compiler will expand it into componentwise mul+add and has trouble re-vectorizing that into paired fma instructions.

I think shaderFloat16 is not enabled on most pre-Turing devices (for performance reasons) so they would probably not be affected. But a few, like Tesla P100(?) presumably would.

How do you obtain the sass code? Is there any documented way of doing that? I am only aware of processing the files in ~/.nv/GLCache using 3rd party tools like https://github.com/therontarigo/nvcachetools.

0cc4m · 2025-09-22T11:46:40Z

How do you obtain the sass code? Is there any documented way of doing that? I am only aware of processing the files in ~/.nv/GLCache using 3rd party tools like https://github.com/therontarigo/nvcachetools.

The trick is working for Nvidia. 😄

etasnadi · 2025-09-22T11:53:31Z

How do you obtain the sass code? Is there any documented way of doing that? I am only aware of processing the files in ~/.nv/GLCache using 3rd party tools like https://github.com/therontarigo/nvcachetools.

The trick is working for Nvidia. 😄

Sure, I am interested not only because it would make profiling easier but it would be also nice if we could inject any sass code into the shader compiled with nvcc.

0cc4m · 2025-09-22T12:05:41Z

How do you obtain the sass code? Is there any documented way of doing that? I am only aware of processing the files in ~/.nv/GLCache using 3rd party tools like https://github.com/therontarigo/nvcachetools.

The trick is working for Nvidia. 😄

Sure, I am interested not only because it would make profiling easier but it would be also nice if we could inject any sass code into the shader compiled with nvcc.

I've asked this as well: #15363 (comment)

etasnadi · 2025-09-22T12:59:10Z

How do you obtain the sass code? Is there any documented way of doing that? I am only aware of processing the files in ~/.nv/GLCache using 3rd party tools like https://github.com/therontarigo/nvcachetools.

The trick is working for Nvidia. 😄

Sure, I am interested not only because it would make profiling easier but it would be also nice if we could inject any sass code into the shader compiled with nvcc.

I've asked this as well: #15363 (comment)

Thanks, I didn't know that you can profile compute shaders with Nsight on instruction level (e.g bank conflicts, register usage, etc.).

* vulkan: Change the mul_mm shared memory and register caching system to use vec2 instead of scalars, to enable using dot2 instructions * use fma instead of dot to fix Nvidia and Apple performance issues

netrunnereve · 2025-09-28T02:29:02Z

FYI I took a quick look on the ACO side and turns out they don't have special support for FP16 dot products. They do have packed fmas and multiplies though, and surprisingly they have support for BF16 dot products for RDNA 3 and 4. Of course there's no point in doing that in our case as those chips have coopmat.

There's actually a Mesa PR for this but it got closed.

0cc4m · 2025-09-29T06:08:50Z

Thank you for checking. Can you open a RADV issue for it? They closed that PR because it didn't bring benefits for games, but I think for us it would be helpful.

netrunnereve · 2025-09-29T15:52:24Z

Sure if you provide the spirv and Mesa assembly dump for your card. Please generate that with the dot() function as spirv has an instruction for that and that's what the compiler will be looking for, I don't think it can recognise the fmas you're using now.

Otherwise I'll deal with this once I get my MI card 😁.

0cc4m · 2025-10-02T06:53:03Z

I talked briefly with the author of that Mesa PR, and basically there are precision issues with the dot2 instructions and different unexpected behaviour across the AMD generations implementing it, that's why they don't want to use them.

vulkan: Change the mul_mm shared memory and register caching system t…

be5a0a8

…o use vec2 instead of scalars, to enable using dot2 instructions

0cc4m requested a review from jeffbolznv September 17, 2025 16:34

github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Sep 17, 2025

jeffbolznv approved these changes Sep 18, 2025

View reviewed changes

0cc4m marked this pull request as draft September 19, 2025 06:24

use fma instead of dot to fix Nvidia and Apple performance issues

e604b63

0cc4m marked this pull request as ready for review September 19, 2025 18:44

0cc4m requested a review from jeffbolznv September 19, 2025 18:45

jeffbolznv approved these changes Sep 19, 2025

View reviewed changes

netrunnereve approved these changes Sep 19, 2025

View reviewed changes

0cc4m merged commit 803dac2 into master Sep 20, 2025
92 of 94 checks passed

0cc4m deleted the 0cc4m/vulkan-mm-use-vec-dot branch September 20, 2025 08:43

wbruna mentioned this pull request Sep 21, 2025

ggml 8de5c82360 broke image generation on Vulkan leejet/stable-diffusion.cpp#847

Closed

LostRuins mentioned this pull request Sep 21, 2025

update stable-diffusion.cpp to master-306-2abe945 LostRuins/koboldcpp#1732

Merged

LostRuins added a commit to LostRuins/koboldcpp that referenced this pull request Sep 21, 2025

Revert "vulkan: use vec dot for matrix matrix multiplications (ggml-o…

3170fc3

…rg#16056)" This reverts commit 803dac2.

0cc4m mentioned this pull request Sep 21, 2025

vulkan: vec dot matrix multiplication fix #16151

Merged

LostRuins added a commit to LostRuins/koboldcpp that referenced this pull request Sep 22, 2025

Revert "Revert "vulkan: use vec dot for matrix matrix multiplications (…

b5931c9

…ggml-org#16056)"" This reverts commit 3170fc3.

0cc4m mentioned this pull request Oct 2, 2025

Misc. bug: Critical Crash and Performance Degradation in Vulkan Build since Release B6524 #16301

Open

vulkan: use vec dot for matrix matrix multiplications #16056

vulkan: use vec dot for matrix matrix multiplications #16056

Conversation

0cc4m commented Sep 17, 2025

Uh oh!

0cc4m commented Sep 17, 2025

Coopmat1:

fp16 only, no coopmat or integer dot:

without integer dot:

fp32 only:

without integer dot:

Uh oh!

netrunnereve commented Sep 18, 2025

Uh oh!

0cc4m commented Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeffbolznv commented Sep 19, 2025

Uh oh!

0cc4m commented Sep 19, 2025

Uh oh!

jeffbolznv commented Sep 19, 2025

Uh oh!

0cc4m commented Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

0cc4m commented Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

netrunnereve commented Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

0cc4m commented Sep 19, 2025

Uh oh!

jeffbolznv left a comment

Choose a reason for hiding this comment

Uh oh!

netrunnereve left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

wbruna commented Sep 21, 2025

Uh oh!

etasnadi commented Sep 22, 2025

Uh oh!

0cc4m commented Sep 22, 2025

Uh oh!

etasnadi commented Sep 22, 2025

Uh oh!

0cc4m commented Sep 22, 2025

Uh oh!

etasnadi commented Sep 22, 2025

Uh oh!

netrunnereve commented Sep 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

0cc4m commented Sep 29, 2025

Uh oh!

netrunnereve commented Sep 29, 2025

Uh oh!

0cc4m commented Oct 2, 2025

Uh oh!

Uh oh!

0cc4m commented Sep 19, 2025 •

edited

Loading

0cc4m commented Sep 19, 2025 •

edited

Loading

0cc4m commented Sep 19, 2025 •

edited

Loading

netrunnereve commented Sep 19, 2025 •

edited

Loading

netrunnereve commented Sep 28, 2025 •

edited

Loading