CUDA: refactor and deduplicate vector FA kernels #16208

JohannesGaessler · 2025-09-23T22:26:46Z

This PR:

Refactors and deduplicates the CUDA vector FlashAttention kernels. As with the mma and tile kernels, the KQ accumulation and softmax are always done with FP32, but if fast FP16 is available it is used for KQ dot products, as well as VKQ dot products and accumulation.
Decouples the number of threads from the head size. This enables the use of KV cache quantization for head sizes 64 and 256, previously only head size 128 was properly supported.
Refactors the memory layout to use larger copies, eliminate inter-warp dependencies during the main loop, and reduce intra-warp dependencies as well as shared memory I/O.

Performance changes

GPU	Model	KV type	Microbatch size	Test	t/s master	t/s `cf0d098`	Speedup
MI50	gemma 2B Q4_0	f16	1	pp16384	107.96	165.65	1.53
MI50	gemma 2B Q4_0	f16	2	pp16384	206.73	175.60	0.85
MI50	gemma 2B Q4_0	f16	4	pp16384	210.30	189.33	0.90
MI50	gemma 2B Q4_0	f16	8	pp16384	180.28	201.53	1.12
MI50	internlm2 ?B Q4_0	f16	1	pp16384	73.88	158.28	2.14
MI50	internlm2 ?B Q4_0	f16	2	pp16384	147.63	254.77	1.73
MI50	internlm2 ?B Q4_0	f16	4	pp16384	166.44	239.66	1.44
MI50	internlm2 ?B Q4_0	f16	8	pp16384	145.51	266.28	1.83
MI50	internlm2 ?B Q4_0	q4_0	1	pp16384	78.64	149.69	1.90
MI50	internlm2 ?B Q4_0	q4_0	2	pp16384	135.53	270.69	2.00
MI50	internlm2 ?B Q4_0	q4_0	4	pp16384	183.33	285.20	1.56
MI50	internlm2 ?B Q4_0	q4_0	8	pp16384	160.97	373.99	2.32
MI50	internlm2 ?B Q4_0	q4_1	1	pp16384	81.87	155.65	1.90
MI50	internlm2 ?B Q4_0	q4_1	2	pp16384	148.07	282.66	1.91
MI50	internlm2 ?B Q4_0	q4_1	4	pp16384	201.00	294.38	1.46
MI50	internlm2 ?B Q4_0	q4_1	8	pp16384	164.88	390.23	2.37
MI50	internlm2 ?B Q4_0	q5_0	1	pp16384	80.85	134.30	1.66
MI50	internlm2 ?B Q4_0	q5_0	2	pp16384	143.14	242.74	1.70
MI50	internlm2 ?B Q4_0	q5_0	4	pp16384	78.06	258.43	3.31
MI50	internlm2 ?B Q4_0	q5_0	8	pp16384	166.73	334.86	2.01
MI50	internlm2 ?B Q4_0	q5_1	1	pp16384	88.48	142.75	1.61
MI50	internlm2 ?B Q4_0	q5_1	2	pp16384	146.85	247.10	1.68
MI50	internlm2 ?B Q4_0	q5_1	4	pp16384	89.69	262.63	2.93
MI50	internlm2 ?B Q4_0	q5_1	8	pp16384	168.93	339.54	2.01
MI50	internlm2 ?B Q4_0	q8_0	1	pp16384	80.74	147.97	1.83
MI50	internlm2 ?B Q4_0	q8_0	2	pp16384	141.34	271.01	1.92
MI50	internlm2 ?B Q4_0	q8_0	4	pp16384	180.74	288.18	1.59
MI50	internlm2 ?B Q4_0	q8_0	8	pp16384	168.19	381.03	2.27
MI50	llama 1B Q4_0	f16	1	pp16384	177.31	176.48	1.00
MI50	llama 1B Q4_0	f16	2	pp16384	345.88	343.80	0.99
MI50	llama 1B Q4_0	f16	4	pp16384	468.29	469.14	1.00
MI50	llama 1B Q4_0	f16	8	pp16384	768.16	768.06	1.00
P40	gemma 2B Q4_0	f16	1	pp16384	120.01	136.15	1.13
P40	gemma 2B Q4_0	f16	2	pp16384	201.83	258.05	1.28
P40	gemma 2B Q4_0	f16	4	pp16384	238.96	321.69	1.35
P40	gemma 2B Q4_0	f16	8	pp16384	309.12	452.65	1.46
P40	internlm2 ?B Q4_0	f16	1	pp16384	114.38	116.74	1.02
P40	internlm2 ?B Q4_0	f16	2	pp16384	193.87	220.39	1.14
P40	internlm2 ?B Q4_0	f16	4	pp16384	232.12	318.61	1.37
P40	internlm2 ?B Q4_0	f16	8	pp16384	288.50	440.53	1.53
P40	internlm2 ?B Q4_0	q4_0	1	pp16384	100.44	121.99	1.21
P40	internlm2 ?B Q4_0	q4_0	2	pp16384	149.51	157.62	1.05
P40	internlm2 ?B Q4_0	q4_0	4	pp16384	171.23	187.23	1.09
P40	internlm2 ?B Q4_0	q4_0	8	pp16384	210.05	228.61	1.09
P40	internlm2 ?B Q4_0	q4_1	1	pp16384	101.70	128.41	1.26
P40	internlm2 ?B Q4_0	q4_1	2	pp16384	151.12	198.64	1.31
P40	internlm2 ?B Q4_0	q4_1	4	pp16384	172.61	240.67	1.39
P40	internlm2 ?B Q4_0	q4_1	8	pp16384	213.89	303.81	1.42
P40	internlm2 ?B Q4_0	q5_0	1	pp16384	60.55	84.31	1.39
P40	internlm2 ?B Q4_0	q5_0	2	pp16384	76.54	121.79	1.59
P40	internlm2 ?B Q4_0	q5_0	4	pp16384	81.46	138.27	1.70
P40	internlm2 ?B Q4_0	q5_0	8	pp16384	87.06	158.48	1.82
P40	internlm2 ?B Q4_0	q5_1	1	pp16384	62.99	101.47	1.61
P40	internlm2 ?B Q4_0	q5_1	2	pp16384	80.14	131.14	1.64
P40	internlm2 ?B Q4_0	q5_1	4	pp16384	85.35	148.91	1.74
P40	internlm2 ?B Q4_0	q5_1	8	pp16384	91.33	170.49	1.87
P40	internlm2 ?B Q4_0	q8_0	1	pp16384	102.09	124.77	1.22
P40	internlm2 ?B Q4_0	q8_0	2	pp16384	155.98	209.34	1.34
P40	internlm2 ?B Q4_0	q8_0	4	pp16384	179.36	261.06	1.46
P40	internlm2 ?B Q4_0	q8_0	8	pp16384	213.55	332.76	1.56
P40	llama 1B Q4_0	f16	1	pp16384	178.69	214.22	1.20
P40	llama 1B Q4_0	f16	2	pp16384	275.41	385.27	1.40
P40	llama 1B Q4_0	f16	4	pp16384	308.42	472.74	1.53
P40	llama 1B Q4_0	f16	8	pp16384	369.76	643.42	1.74
RTX 3090	gemma 2B Q4_0	f16	1	pp16384	337.91	336.99	1.00
RTX 3090	gemma 2B Q4_0	f16	2	pp16384	596.45	592.53	0.99
RTX 3090	gemma 2B Q4_0	f16	4	pp16384	1030.47	1032.56	1.00
RTX 3090	gemma 2B Q4_0	f16	8	pp16384	1352.03	1365.43	1.01
RTX 3090	internlm2 ?B Q4_0	f16	1	pp16384	297.98	298.00	1.00
RTX 3090	internlm2 ?B Q4_0	f16	2	pp16384	524.53	522.27	1.00
RTX 3090	internlm2 ?B Q4_0	f16	4	pp16384	937.79	936.12	1.00
RTX 3090	internlm2 ?B Q4_0	f16	8	pp16384	1412.27	1407.37	1.00
RTX 3090	internlm2 ?B Q4_0	q4_0	1	pp16384	320.62	320.41	1.00
RTX 3090	internlm2 ?B Q4_0	q4_0	2	pp16384	372.32	373.82	1.00
RTX 3090	internlm2 ?B Q4_0	q4_0	4	pp16384	679.37	679.34	1.00
RTX 3090	internlm2 ?B Q4_0	q4_0	8	pp16384	1091.64	1095.65	1.00
RTX 3090	internlm2 ?B Q4_0	q4_1	1	pp16384	342.22	342.24	1.00
RTX 3090	internlm2 ?B Q4_0	q4_1	2	pp16384	371.38	371.05	1.00
RTX 3090	internlm2 ?B Q4_0	q4_1	4	pp16384	674.19	676.69	1.00
RTX 3090	internlm2 ?B Q4_0	q4_1	8	pp16384	1081.41	1080.61	1.00
RTX 3090	internlm2 ?B Q4_0	q5_0	1	pp16384	284.63	285.62	1.00
RTX 3090	internlm2 ?B Q4_0	q5_0	2	pp16384	336.81	335.56	1.00
RTX 3090	internlm2 ?B Q4_0	q5_0	4	pp16384	614.13	613.81	1.00
RTX 3090	internlm2 ?B Q4_0	q5_0	8	pp16384	985.09	981.96	1.00
RTX 3090	internlm2 ?B Q4_0	q5_1	1	pp16384	320.54	319.61	1.00
RTX 3090	internlm2 ?B Q4_0	q5_1	2	pp16384	341.00	341.36	1.00
RTX 3090	internlm2 ?B Q4_0	q5_1	4	pp16384	621.39	618.65	1.00
RTX 3090	internlm2 ?B Q4_0	q5_1	8	pp16384	1005.26	1005.62	1.00
RTX 3090	internlm2 ?B Q4_0	q8_0	1	pp16384	323.34	324.65	1.00
RTX 3090	internlm2 ?B Q4_0	q8_0	2	pp16384	362.84	363.28	1.00
RTX 3090	internlm2 ?B Q4_0	q8_0	4	pp16384	660.38	659.74	1.00
RTX 3090	internlm2 ?B Q4_0	q8_0	8	pp16384	1044.19	1047.50	1.00
RTX 3090	llama 1B Q4_0	f16	1	pp16384	499.63	501.04	1.00
RTX 3090	llama 1B Q4_0	f16	2	pp16384	864.06	863.00	1.00
RTX 3090	llama 1B Q4_0	f16	4	pp16384	1502.10	1504.38	1.00
RTX 3090	llama 1B Q4_0	f16	8	pp16384	2312.14	2304.54	1.00
RTX 4090	gemma 2B Q4_0	f16	1	pp16384	442.21	443.69	1.00
RTX 4090	gemma 2B Q4_0	f16	2	pp16384	723.16	726.60	1.00
RTX 4090	gemma 2B Q4_0	f16	4	pp16384	1384.84	1382.12	1.00
RTX 4090	gemma 2B Q4_0	f16	8	pp16384	1998.59	1999.49	1.00
RTX 4090	internlm2 ?B Q4_0	f16	1	pp16384	373.72	374.73	1.00
RTX 4090	internlm2 ?B Q4_0	f16	2	pp16384	622.28	622.31	1.00
RTX 4090	internlm2 ?B Q4_0	f16	4	pp16384	1227.15	1231.67	1.00
RTX 4090	internlm2 ?B Q4_0	f16	8	pp16384	1983.17	1987.15	1.00
RTX 4090	internlm2 ?B Q4_0	q4_0	1	pp16384	429.58	449.08	1.05
RTX 4090	internlm2 ?B Q4_0	q4_0	2	pp16384	549.58	718.78	1.31
RTX 4090	internlm2 ?B Q4_0	q4_0	4	pp16384	1079.55	1081.15	1.00
RTX 4090	internlm2 ?B Q4_0	q4_0	8	pp16384	1764.52	1760.72	1.00
RTX 4090	internlm2 ?B Q4_0	q4_1	1	pp16384	429.71	454.61	1.06
RTX 4090	internlm2 ?B Q4_0	q4_1	2	pp16384	547.64	727.61	1.33
RTX 4090	internlm2 ?B Q4_0	q4_1	4	pp16384	1070.18	1071.83	1.00
RTX 4090	internlm2 ?B Q4_0	q4_1	8	pp16384	1745.26	1746.25	1.00
RTX 4090	internlm2 ?B Q4_0	q5_0	1	pp16384	384.70	416.84	1.08
RTX 4090	internlm2 ?B Q4_0	q5_0	2	pp16384	516.37	680.59	1.32
RTX 4090	internlm2 ?B Q4_0	q5_0	4	pp16384	1001.22	1007.09	1.01
RTX 4090	internlm2 ?B Q4_0	q5_0	8	pp16384	1632.30	1629.61	1.00
RTX 4090	internlm2 ?B Q4_0	q5_1	1	pp16384	404.47	436.09	1.08
RTX 4090	internlm2 ?B Q4_0	q5_1	2	pp16384	522.52	705.03	1.35
RTX 4090	internlm2 ?B Q4_0	q5_1	4	pp16384	1011.96	1014.96	1.00
RTX 4090	internlm2 ?B Q4_0	q5_1	8	pp16384	1655.43	1660.77	1.00
RTX 4090	internlm2 ?B Q4_0	q8_0	1	pp16384	412.88	423.43	1.03
RTX 4090	internlm2 ?B Q4_0	q8_0	2	pp16384	530.25	690.37	1.30
RTX 4090	internlm2 ?B Q4_0	q8_0	4	pp16384	1028.20	1029.95	1.00
RTX 4090	internlm2 ?B Q4_0	q8_0	8	pp16384	1657.93	1654.93	1.00
RTX 4090	llama 1B Q4_0	f16	1	pp16384	648.76	668.58	1.03
RTX 4090	llama 1B Q4_0	f16	2	pp16384	1036.57	1044.80	1.01
RTX 4090	llama 1B Q4_0	f16	4	pp16384	2035.72	2024.55	0.99
RTX 4090	llama 1B Q4_0	f16	8	pp16384	3238.55	3234.24	1.00
RX 6800	gemma 2B Q4_0	f16	1	pp16384	109.48	129.81	1.19
RX 6800	gemma 2B Q4_0	f16	2	pp16384	119.82	143.82	1.20
RX 6800	gemma 2B Q4_0	f16	4	pp16384	158.89	168.73	1.06
RX 6800	gemma 2B Q4_0	f16	8	pp16384	176.88	191.15	1.08
RX 6800	internlm2 ?B Q4_0	f16	1	pp16384	58.79	112.01	1.91
RX 6800	internlm2 ?B Q4_0	f16	2	pp16384	88.66	171.27	1.93
RX 6800	internlm2 ?B Q4_0	f16	4	pp16384	96.47	281.93	2.92
RX 6800	internlm2 ?B Q4_0	f16	8	pp16384	98.63	321.10	3.26
RX 6800	internlm2 ?B Q4_0	q4_0	1	pp16384	62.18	111.82	1.80
RX 6800	internlm2 ?B Q4_0	q4_0	2	pp16384	111.85	213.92	1.91
RX 6800	internlm2 ?B Q4_0	q4_0	4	pp16384	88.88	324.86	3.65
RX 6800	internlm2 ?B Q4_0	q4_0	8	pp16384	104.07	411.65	3.96
RX 6800	internlm2 ?B Q4_0	q4_1	1	pp16384	62.67	113.31	1.81
RX 6800	internlm2 ?B Q4_0	q4_1	2	pp16384	113.07	221.49	1.96
RX 6800	internlm2 ?B Q4_0	q4_1	4	pp16384	93.61	341.20	3.64
RX 6800	internlm2 ?B Q4_0	q4_1	8	pp16384	103.30	431.99	4.18
RX 6800	internlm2 ?B Q4_0	q5_0	1	pp16384	52.29	84.82	1.62
RX 6800	internlm2 ?B Q4_0	q5_0	2	pp16384	73.98	151.41	2.05
RX 6800	internlm2 ?B Q4_0	q5_0	4	pp16384	93.85	212.44	2.26
RX 6800	internlm2 ?B Q4_0	q5_0	8	pp16384	99.53	262.80	2.64
RX 6800	internlm2 ?B Q4_0	q5_1	1	pp16384	40.27	85.17	2.11
RX 6800	internlm2 ?B Q4_0	q5_1	2	pp16384	102.04	158.21	1.55
RX 6800	internlm2 ?B Q4_0	q5_1	4	pp16384	114.37	226.57	1.98
RX 6800	internlm2 ?B Q4_0	q5_1	8	pp16384	101.26	282.50	2.79
RX 6800	internlm2 ?B Q4_0	q8_0	1	pp16384	58.37	106.93	1.83
RX 6800	internlm2 ?B Q4_0	q8_0	2	pp16384	115.62	216.44	1.87
RX 6800	internlm2 ?B Q4_0	q8_0	4	pp16384	95.50	330.94	3.47
RX 6800	internlm2 ?B Q4_0	q8_0	8	pp16384	104.95	422.05	4.02
RX 6800	llama 1B Q4_0	f16	1	pp16384	107.18	181.79	1.70
RX 6800	llama 1B Q4_0	f16	2	pp16384	137.38	315.16	2.29
RX 6800	llama 1B Q4_0	f16	4	pp16384	121.12	424.00	3.50
RX 6800	llama 1B Q4_0	f16	8	pp16384	115.33	561.43	4.87

Small performance boost on modern NVIDIA for quantized KV cache and batch sizes 1-2, moderate performance boost for old NVIDIA and batch sizes 1-8, large performance boost for AMD and batch sizes 1-8.

JohannesGaessler · 2025-09-23T22:31:52Z

Context for the table: LLaMA 3.2 has a head size of 64, IntenLM 2 has a head size of 128, Gemma has a head size of 256. I chose these models because I needed models that cover these head sizes and are small enough to result in benchmark runs that only take ~1 hour at most.

Also after this PR has been merged it would be fine to pad the KV cache to only multiples of 128 rather than 256.

JohannesGaessler · 2025-09-24T09:56:18Z

Also after this PR has been merged it would be fine to pad the KV cache to only multiples of 128 rather than 256.

Actually, there is still an issue with the WMMA kernel (used for Volta and rocWMMA) assuming a padding of 256 but that issue should be resolvable with manageable effort. Long-term I want to completely replace the WMMA kernel with the mma kernel.

)" This reverts commit 75a3a6c.

@ikawrakow

)" This reverts commit 75a3a6c. d Update cudart64_12.dll Revert "Cudart 12.9" This reverts commit f79c687. Revert "Allow compile exe, pdf features off" This reverts commit 5e1c154. Update fattn.cu Update set-rows.cu batches Revert "try fix fattn again, porting some older code. the cc detection is not working well, so its hacky" This reverts commit 7b04191. Update ggml-cuda.cu Update fattn.cu Update fattn.cu Update fattn.cu Add option to disable MMA support on Turing Author : pt13762104 GGML_CUDA_NO_PEER_COPY to try to fix a crash on Gemma 3 Deactivate SWA when Fast Forwarding, commented Wrench Fix for the SWA I borked Clean-up quantkv algo comment warp sizes for now in IQ_K MMQ Kernels KV 24 -> KV 31 Add a readme. ngxson's commented hack Try some hack for gpt-oss Update llama-vocab.cpp Bump Windows max open files from 512 to 2048 Author : Thireus CLI - Specify GGML_TYPE to quantize for the main tensors. (#91) To complement the token_embd.weight and output.weight : attn_v.weight attn_k.weight. attn_q_weight attn_output.weight attn_qkv.weight ffn_gate ffn_down ffn_up EsoCroK naming v1.99430_b6645-6_Q6-IO2346_RMv1.17.99m Disable I2_K cpu quantization. To allow compilation. MMQ code adaptation Update mmq.cuh MMQ Initial code for IQ2,3,4,5,6_K IQ_K quants first gen (4, 5, 6) Some logs back Batches Croco Bench. Double the anti-abuse limits Allow compile exe, pdf features off Revert "Allow compile exe, pdf features off" This reverts commit 5e2451f129f0bca326f74aae24df475c0410cdbf. Update koboldcpp.py Revert "Allow compile exe, pdf features off" This reverts commit 2a7e9e004e8578a05fb67967d09cf36263867b9b. Revert "Allow compile exe, pdf features off" This reverts commit b4fd7809a4f77ff18bd415fcfb2d5f435e3b63a3. quantization tweaks iq3_ks quantization tweaks Minor iq3_k tweak q2_K tweaks q3_K tweaks q4_K tweaks q5_K tweaks GGUF v14 attempt of second fix. loosen gguf restrictions. Quantization improvements #295 and #302, GGML part only Improved IQ2_XS quantization #312 Improved IQ1_M quantization #327 ggml_row_size accounting fix for GGUF v14 Credits : @ikawrakow Fighting with cmake #279 Drop the GGML count limitation limit Old markings Customize KCPP.py Croco additional chat adapters andtemplates Reinstate "skip barrier of noop" Allow q8_0 KV cache for head size 256 #330 Up FA KV modes 256 candidates (1024 with Grammar) Adapt q6_0 MMQ to llama.cpp mainline Q6_0 MMQ Kernel attempt MMQ for Q6_0 authored by Ikawrakow Add Q6_0 MMQ to template generator authored by Ikawrakow Q6_0 KVQ for KCPP/Croco -> KV22 For release. fix a few lazy-cuts and hiccups left during the merge of IQ4_NL. dequantize for q6_0 and related cpy Enable q6_0 for flash attention As with IQ4_NL, just for head size of 128 for now. Without GGML_CUDA_FA_ALL_QUANTS set, only Q6_0 + Q5_0 and Q8_0 + Q6_0 are included. With this the VRAM poor have better options for selecting the best possible (as allowed by VRAM, model size, context length) quantized KV-cache. PR by Ikawrakow on ik_llama.cpp Adding Q6_0 (#77) Rev 20240807 * Adding q6_0 - basics + AVX2/Zen4 working * Adding q6_0: CUDA dequantize works, but not mmvq * Adding q6_0: CUDA mmvq works * Adding q6_0: CUDA cpy, so Q6_0 can be used for KV-cache * Add q6_0 to CPU flash attention Disappointing result: for LlaMA-3.2-1B, q6_0 K- and V-cache gives about the same PPL as q8_0 K-cache and q4_0 V-cache, while needing the exact same RAM. I.e., what was the point? * q6_0: slightly better kv-cache result Better than q8_0+q4_0, but not as good as q8_0+iq4_nl * q6_0: works on ARM_NEON * q6_0: dequantize works on Metal, but not vector dot product * q6_0: it now works on Metal Outperforms q5_0 by a significant margin. E.g. | model | size | params | backend | ngl | threads | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------------: | ---------------: | | llama 8B Q6_0 | 6.08 GiB | 8.03 B | Metal | 100 | 4 | tg128 | 44.02 ± 0.08 | | llama 8B Q5_0 | 5.21 GiB | 8.03 B | Metal | 100 | 4 | tg128 | 40.13 ± 0.12 | | llama 8B Q6_0 | 6.08 GiB | 8.03 B | Metal | 100 | 4 | pp512 | 500.55 ± 0.32 | | llama 8B Q5_0 | 5.21 GiB | 8.03 B | Metal | 100 | 4 | pp512 | 448.02 ± 0.27 | * q6_0: can now be used for kv-cache on Metal -> skipped. --------- Adaptation to mainline by me! IQ4_NL KVQ for KCPP/Croco missing templates instances for KVQ IQ4_NL Update fattn.cu for KVQ IQ4_NL Update fattn-vec-f16.cuh for KVQ IQ4_NL Update fattn-vec-f32.cuh for KVQ IQ4_NL CML and Makefile FOR IQ4_NL KV_IQ4_NL uncommenting VEC16 cases KV_IQ4_NL uncommenting VEC32 cases Enable IQ4_NL for V-cache in token generation Add IQ4_NL + IQ4_NL to FA This is a better alternative than Q4_0 + Q4_0 for the VRAM poor. Comment unwanted add-in in makefile iq4_nl: faster quantization (#76) CUDA: faster float -> iq4_nl conversion (#73) * iqk_mul_mat: better iq4_nl implementation on Zen4/AVX2 PP-512 performance for LLaMA-3.1-8B goes to 162.6 t/s up from 133.2 t/s. Default Blas Batch Size = 128 Quant KV and Draft QKV, 24 modes With customizable QKV for the draft as well. And reduced Blas Batch Size for the draft model. Default Draft Amount = 4 Bench context size Max contextsize and steps Croco CML SCHED_MAX_COPIES = 1 And Croco usual additions to the CMakeList Cudart 12.9 Revert "CUDA: faster tile FA (Pascal/AMD), headsize 256 (ggml-org#15769)" This reverts commit 79bc429. Revert "HIP: use v_dot2_f32_f16 instruction for FA (ggml-org#15884)" This reverts commit 17bc5a8. Revert "CUDA: larger SRAM reads for tile FA, AMD FP16 dot (ggml-org#15927)" This reverts commit 0e6ff00. Revert "CUDA: fix FA occupancy, optimize tile kernel (ggml-org#15982)" This reverts commit c959b67. Revert "CUDA: fix compilation on CC 6.0 (ggml-org#16091)" This reverts commit 368560a. Co-Authored-By: Kawrakow <[email protected]> Co-Authored-By: Iwan Kawrakow <[email protected]>

)" This reverts commit 75a3a6c.

CUDA: refactor and deduplicate vector FA kernels

e267903

JohannesGaessler requested a review from slaren as a code owner September 23, 2025 22:26

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs python python script changes ggml changes relating to the ggml tensor library for machine learning labels Sep 23, 2025

fix kernel selection logic

8ba0ff7

JohannesGaessler force-pushed the cuda-fa-vec-128-4 branch from c7ec17f to 8ba0ff7 Compare September 24, 2025 19:21

JohannesGaessler mentioned this pull request Sep 27, 2025

hip : substituted bpermute ops with swizzle ops (gfx906, maybe all AMD) #16291

Open

slaren approved these changes Sep 27, 2025

View reviewed changes

JohannesGaessler merged commit 75a3a6c into ggml-org:master Sep 27, 2025
63 of 67 checks passed

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Sep 27, 2025

Revert "CUDA: refactor and deduplicate vector FA kernels (ggml-org#16208

9994db4

)" This reverts commit 75a3a6c.

JohannesGaessler mentioned this pull request Sep 28, 2025

ggml : add support for non-padded FA KV #16148

Open

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Sep 29, 2025

Revert "CUDA: refactor and deduplicate vector FA kernels (ggml-org#16208

40fb982

)" This reverts commit 75a3a6c.

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Sep 30, 2025

Revert "CUDA: refactor and deduplicate vector FA kernels (ggml-org#16208

822248c

)" This reverts commit 75a3a6c.

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Oct 1, 2025

Revert "CUDA: refactor and deduplicate vector FA kernels (ggml-org#16208

df337c9

)" This reverts commit 75a3a6c.

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Oct 2, 2025

Revert "CUDA: refactor and deduplicate vector FA kernels (ggml-org#16208

3be9cf1

)" This reverts commit 75a3a6c.

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Oct 2, 2025

Revert "CUDA: refactor and deduplicate vector FA kernels (ggml-org#16208

2916a09

)" This reverts commit 75a3a6c.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CUDA: refactor and deduplicate vector FA kernels #16208

CUDA: refactor and deduplicate vector FA kernels #16208

JohannesGaessler commented Sep 23, 2025

Uh oh!

JohannesGaessler commented Sep 23, 2025

Uh oh!

JohannesGaessler commented Sep 24, 2025

Uh oh!

Uh oh!

Uh oh!

CUDA: refactor and deduplicate vector FA kernels #16208

CUDA: refactor and deduplicate vector FA kernels #16208

Conversation

JohannesGaessler commented Sep 23, 2025

Uh oh!

JohannesGaessler commented Sep 23, 2025

Uh oh!

JohannesGaessler commented Sep 24, 2025

Uh oh!

Uh oh!

Uh oh!