-
Notifications
You must be signed in to change notification settings - Fork 13.2k
vulkan: use vec dot for matrix matrix multiplications #16056
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…o use vec2 instead of scalars, to enable using dot2 instructions
Nvidia RTX 3090Coopmat1:
fp16 only, no coopmat or integer dot:
AMD Radeon Pro VII
without integer dot:
fp32 only:
AMD Radeon RX 6800 XT
without integer dot:
Intel A770
In summary, positive for Nvidia coopmat1, slightly positive for AMD Vega20, slightly negative for AMD RDNA2, and very positive for Intel Alchemist. |
Here's a quick test on my 470, it runs slightly faster or slower depending on the quant used. From looking at the changes it shouldn't make a difference but I guess the driver's compiling things a bit differently. As for your results the only thing that really matters are the non dp4a p512 numbers. Something is really really wrong with those Nvidia runs and I have a feeling that this might have terrible results on a Maxwell or Pascal chip. Meanwhile for AMD did you check if you're using the V_DOT2_F32_F16 instruction? At least with RGP I've never been able to get it to generate that even if I use Intel's good so that's good, and it looks like that it's now using the dot product instructions. PR:
Master:
|
@netrunnereve Thank you for the test. Yeah, I'm still torn on this. It's good for Intel and coopmat1, but slightly negative for other cases, very negative on Nvidia and on Apple. I'll put it back on draft for now and try to figure out whether there's a way to improve this. I did not test it with RGP, so maybe it's not actually hitting the instructions. But I'd have to check with RADV, not sure how that works. @jeffbolznv Do you know why this is very bad on Ampere (non-coopmat) and whether this likely affects pre-Turing the same way? |
I didn't see anything obvious in the change so I looked at the sass we generate. I think the change is due to the use of the I think shaderFloat16 is not enabled on most pre-Turing devices (for performance reasons) so they would probably not be affected. But a few, like Tesla P100(?) presumably would. |
That's odd, I used dot specifically to create a situation where there are two independent multiplies/adds that should be easy to fuse into one of the dot or fma instructions that many GPU architectures have. I have to look at AMD assembly to see if it worked there. |
Yeah, this is our compiler not doing a great job. It might work better to do fmas on the vectors and then combine the two components at the end. |
@jeffbolznv You are correct, thank you. This diff fixes the performance issue on Nvidia and Apple: diff --git a/ggml/src/ggml-vulkan/vulkan-shaders/mul_mm.comp b/ggml/src/ggml-vulkan/vulkan-shaders/mul_mm.comp
index d22230fad..38a4d07d0 100644
--- a/ggml/src/ggml-vulkan/vulkan-shaders/mul_mm.comp
+++ b/ggml/src/ggml-vulkan/vulkan-shaders/mul_mm.comp
@@ -357,7 +357,7 @@ void main() {
[[unroll]] for (uint cc = 0; cc < TN; cc++) {
[[unroll]] for (uint cr = 0; cr < TM; cr++) {
const uint sums_idx = (wsic * TN + cc) * (WMITER * TM) + wsir * TM + cr;
- sums[sums_idx] += dot(ACC_TYPE_VEC2(cache_a[wsir * TM + cr]), ACC_TYPE_VEC2(cache_b[cc]));
+ sums[sums_idx] = fma(ACC_TYPE(cache_a[wsir * TM + cr].x), ACC_TYPE(cache_b[cc].x), fma(ACC_TYPE(cache_a[wsir * TM + cr].y), ACC_TYPE(cache_b[cc].y), sums[sums_idx]));
}
}
} |
Here are updated results: Nvidia RTX 3090 without coopmat or integer dot
AMD Radeon Pro VII without integer dot
AMD Radeon Pro VII without integer dot or fp16
AMD Radeon RX 6800 XT without integer dot
Intel A770 without integer dot
@netrunnereve I think you are right, I see lots of |
I think you figured this out already but for radv you can use
Even a packed multiply or fma isn't good as that only does the calculation in parallel for the upper and lower 16 bits, and you'll still need additional instructions to add them together and convert it to fp32. With V_DOT2_F32_F16 this does the entire dot product, fp32 conversion, and a free addition with the previous sum all in one cycle. Honestly this is something worthy of a Mesa ticket (idc about amdvlk anymore as amd got rid of it), and I doubt they've put a lot of effort into this as graphics apps mostly use fp32. Since RGP uses amdvlk in the background it looks like AMD hasn't looked into this use case either. Maybe there should be a shaderfloatingpointdotproduct vulkan extension for this. |
Yeah, I looked for dot first, but I couldn't trigger any dot function on either GPU. At least a few of the packed ones triggered, but fma would have been more interesting than a few quant-specific packed muls. I can trigger dot2 in RDNA4 in amdvlk RGP with this source code:
For RDNA 4 this gives me:
but for older generations at most a packed mul:
With RADV, for Vega20 I just get this:
Yeah, I think we should open an issue with Mesa. I don't think an extension would make it through, since it's basically just up to the compiler to spot the opportunity to use it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I verified the latest commit is generating good code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For RDNA 4 this gives me:
And here they waste a clock cycle for the "conversion" when there's an instruction that already does that. Also v_dot2_f16_f16
is not a RDNA4 exclusive so it looks like there's some different compiler logic for that chip.
Yeah, I think we should open an issue with Mesa. I don't think an extension would make it through, since it's basically just up to the compiler to spot the opportunity to use it.
I mean the spirv literally contains a dot instruction and the compiler should recognise that, hey, this is a fp16 dot product and we have an instruction for it! The only thing an extension would do is force the use of the instruction and also indicate that the GPU has special hardware support for dot products.
I've been thinking of getting a MI25 or MI50 once I figure out if it's possible to cut a hole in the cover and place a fan inside. Adding those mini fans to the front is going to be loud and my computer doesn't have the space for it. If I manage to get that going I'll probably start looking into how Mesa handles the dot products internally.
Anyways I think this is good to merge now.
This change seems to have broken prompt adherence on stable-diffusion.cpp: leejet/stable-diffusion.cpp#847 . |
…ggml-org#16056)"" This reverts commit 3170fc3.
How do you obtain the sass code? Is there any documented way of doing that? I am only aware of processing the files in |
The trick is working for Nvidia. 😄 |
Sure, I am interested not only because it would make profiling easier but it would be also nice if we could inject any sass code into the shader compiled with nvcc. |
I've asked this as well: #15363 (comment) |
Thanks, I didn't know that you can profile compute shaders with Nsight on instruction level (e.g bank conflicts, register usage, etc.). |
* vulkan: Change the mul_mm shared memory and register caching system to use vec2 instead of scalars, to enable using dot2 instructions * use fma instead of dot to fix Nvidia and Apple performance issues
FYI I took a quick look on the ACO side and turns out they don't have special support for FP16 dot products. They do have packed fmas and multiplies though, and surprisingly they have support for BF16 dot products for RDNA 3 and 4. Of course there's no point in doing that in our case as those chips have coopmat. There's actually a Mesa PR for this but it got closed. |
Thank you for checking. Can you open a RADV issue for it? They closed that PR because it didn't bring benefits for games, but I think for us it would be helpful. |
Sure if you provide the spirv and Mesa assembly dump for your card. Please generate that with the Otherwise I'll deal with this once I get my MI card 😁. |
I talked briefly with the author of that Mesa PR, and basically there are precision issues with the dot2 instructions and different unexpected behaviour across the AMD generations implementing it, that's why they don't want to use them. |
This PR changes the shared memory and register caching to use vec2 instead of scalars. Initially this was to enable vec2 dot instructions for the accumulations, but I think it also helps with caching because accessing 32-bit values is more efficient than accessing 16-bit values.
It needs a few more registers because it loads 2 k-values from shared memory into registers instead of just 1.