Enable PyTorch attention in VAE for AMD RDNA 4 #9956

0xDELUXA · 2025-09-19T21:12:19Z

Crashes occasionally occur with or without PyTorch attention (in VAE) in high-res.
But for normal image sizes, it's faster.

Tested on Windows (native PyTorch, not WSL) with a gfx1200, it uses PyTorch attention in VAE. RDNA 3 or earlier cards continue to use split attention as before.

Partially reverts “Disable PyTorch attention in VAE for AMD.” (commit 1cd6cd6) for RDNA4.

qawery-just-sad · 2025-09-21T11:39:20Z

Potential duplicate of #8289

0xDELUXA · 2025-09-21T12:17:01Z

Potential duplicate of #8289

Could be, but this only affects RDNA4. Based on my testing, it benefits PyTorch attention. Can't speak about the other archs, though.

A-Temur · 2025-09-22T04:38:11Z

@0xDELUXA
which pytorch/rocm version are you currently using?
My current setup:
Ubunutu 24.04.03
ROCM 7.0.0
torch 2.8.0 (+vision+triton...)
GPU: 7900 GRE (RDNA 3)

Comfyui uses automatically pytorch attention and so far i had no issues.

But perhaps I haven't come across your specific workflow yet:
If you don't mind sharing your workflow, I could test it on my device and give you some feedback.

0xDELUXA · 2025-09-22T08:43:43Z

@A-Temur
Platform: Windows
Python version: 3.11.9
PyTorch version: 2.10.0a0+rocm7.0.0rc20250918
AMD arch: gfx1200
ROCm version: (7, 1)

Without this PR I had:
Using pytorch attention (because I’m using the --use-pytorch-cross-attention flag)
BUT
Using split attention in VAE
automatically.

After merging this PR:
Using pytorch attention in VAE

As you can see, I'm using a TheRock wheel for Windows. It's in a nightly state and not yet available on pytorch.org.
I think we can't really compare Linux and Windows performance in this case.

Also, does your console say: Using pytorch attention in VAE?

A-Temur · 2025-09-22T11:43:02Z

@0xDELUXA
now i see it:
my console also prints out "Using pytorch attention in VAE", but only when i start comfy. After that i also get only "Using split attention in VAE".

Prior to that i was using ROCM 6.4.3 with pytorch 2.5.1 within a docker setup (on Fedora) and i didn't get the message about pytorch attention, but the performance was very poor (very long loading times especially before starting Ksampler and vae decode).

Since im using ROCM 7 with pytorch 2.8 on ubuntu (no docker) the performance increase is huge and i had no issues so far. I would highly recommend using the recommended Ubuntu/RHEL installation, since ROCM+Radeon didn't do well in my experience on Windows + WSL or other non-supported Linux distros.

On what specific models/workflows do you get the mentioned crash (without this PR)?
I'm curious to check if it's the same for me.

0xDELUXA · 2025-09-22T12:11:44Z

I'm not talking about Windows + WSL. It's native PyTorch on Windows

A-Temur · 2025-09-22T12:25:20Z

@0xDELUXA
Ok but im still asking now the third time to share you workflow/model so others (including me) can check whether the crash only happens on your specific setup.

0xDELUXA · 2025-09-22T12:27:21Z

On what specific models/workflows do you get the mentioned crash (without this PR)?
I'm curious to check if it's the same for me.

This specific PR doesn't do anything about the VAE crashes. It just makes AMD use the better attention optimization in VAE. I think you'll not get crashes on Linux at all. This is a Windows thing AFAIK.

0xDELUXA · 2025-09-22T12:35:25Z

@0xDELUXA
Ok but im still asking now the third time to share you workflow/model so others (including me) can check whether the crash only happens on your specific setup.

https://github.com/comfyanonymous/ComfyUI/blob/master/comfy%2Fmodel_management.py#L1116

I can share my workflow later, when I'll be online

A-Temur · 2025-09-22T12:46:57Z

I've tested now the performance with pytorch attention enabled (forcefully) and without (vanilla comfy) on a simple Image Gen SDXL model:

pytorch attention disabled (default):
First Generation: 18.22 seconds
subsequent Generations:
average 14 seconds

pytorch attention enabled:
first Generation: 25.37 seconds
subsequent Generations:
average 15.4 seconds

RandomGitUser321 · 2025-09-22T13:05:05Z

Might be partially related, but are you guys also making sure to use set TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 before running main.py? You have to run that command with the newer TheRock Windows wheels and I think they mention it somewhere in the million PRs or issues, otherwise, I don't think it will --use-pytorch-cross-attention correctly(it won't use AOTriton if that isn't set beforehand). Or at least it's what I have to do with this gfx110X-dgpu, using Windows 11 24H2.

Also, I think MIOpen might be playing a part in the VAE encode/decode issues. If using set MIOPEN_FIND_MODE=FAST, it will work quickly, but probably doesn't have all the memory savings. If it's set to the default, it will take ages(whole minutes) on the first time you give it a new combination of resolutions, but then subsequent runs will be quick.

0xDELUXA · 2025-09-22T13:22:46Z

I've tested now the performance with pytorch attention enabled (forcefully) and without (vanilla comfy) on a simple Image Gen SDXL model:

pytorch attention disabled (default):
First Generation: 18.22 seconds
subsequent Generations:
average 14 seconds

pytorch attention enabled:
first Generation: 25.37 seconds
subsequent Generations:
average 15.4 seconds

How did you enable it forcefully? By merging this PR locally, or how? I dont think we have a flag specifically for enabling pytorch attention in vae.
Pytorch attention as a whole is one thing, and enabling it in vae is another.

You said:

my console also prints out "Using pytorch attention in VAE", but only when i start comfy. After that i also get only "Using split attention in VAE".

When ComfyUI starts, it doens't print anything about VAE. it prints:
Using pytorch attention
Then when we start generating an image, it prints:
Using split attention in VAE - without this PR, and
Using pytorch attention in VAE - with this PR

0xDELUXA · 2025-09-22T13:26:15Z

@RandomGitUser321

Yes, without the set TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 env var, it shows a warning, and cant use flash or mem-eff attention at all.

If this is a MIOpen thing, then I'm thinking of filing an issue there.

My finding is that it isn’t the attention mechanism in the VAE that causes crashes.

RandomGitUser321 · 2025-09-22T14:07:25Z

If this is a MIOpen thing, then I'm thinking of filing an issue there.

I'm pretty sure it is and I think there may already be some issues for it. For instance, here's one: ROCm/rocm-libraries#1571
It's not 100% directly related to this issue here, but it at least shows there are issues between MIOpen and VAE encode/decoding.

0xDELUXA · 2025-09-22T14:22:04Z

I'm pretty sure it is and I think there may already be some issues for it. For instance, here's one: ROCm/rocm-libraries#1571
It's not 100% directly related to this issue here, but it at least shows there are issues between MIOpen and VAE encode/decoding.

Really good, so the devs know that there's something wrong. I'll try some runs with MIOpen verbose to get some logs, but I don't think I can debug them locally. This seems more like an internal thing

RandomGitUser321 · 2025-09-22T14:51:29Z

You should be able to do some pretty spammy logging with it:
https://rocm.docs.amd.com/projects/MIOpen/en/latest/how-to/debug-log.html

Enable pytorch attention in VAE for AMD RDNA 4

3988f37

0xDELUXA requested a review from comfyanonymous as a code owner September 19, 2025 21:12

0xDELUXA changed the title ~~Enable pytorch attention in VAE for AMD RDNA 4~~ Enable PyTorch attention in VAE for AMD RDNA 4 Sep 19, 2025

RandomGitUser321 mentioned this pull request Sep 24, 2025

[Issue]: [MIOpen - gfx1200/Windows] First SD generation at VAE stage is extremely slow and crashes GPU driver - even with AOTriton enabled ROCm/TheRock#1542

Open

Enable PyTorch attention in VAE for AMD RDNA 4 #9956

Are you sure you want to change the base?

Enable PyTorch attention in VAE for AMD RDNA 4 #9956

Conversation

0xDELUXA commented Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qawery-just-sad commented Sep 21, 2025

Uh oh!

0xDELUXA commented Sep 21, 2025

Uh oh!

A-Temur commented Sep 22, 2025

Uh oh!

0xDELUXA commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

A-Temur commented Sep 22, 2025

Uh oh!

0xDELUXA commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

A-Temur commented Sep 22, 2025

Uh oh!

0xDELUXA commented Sep 22, 2025

Uh oh!

0xDELUXA commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

A-Temur commented Sep 22, 2025

Uh oh!

RandomGitUser321 commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

0xDELUXA commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

0xDELUXA commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RandomGitUser321 commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

0xDELUXA commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RandomGitUser321 commented Sep 22, 2025

Uh oh!

Uh oh!

0xDELUXA commented Sep 19, 2025 •

edited

Loading

0xDELUXA commented Sep 22, 2025 •

edited

Loading

0xDELUXA commented Sep 22, 2025 •

edited

Loading

0xDELUXA commented Sep 22, 2025 •

edited

Loading

RandomGitUser321 commented Sep 22, 2025 •

edited

Loading

0xDELUXA commented Sep 22, 2025 •

edited

Loading

0xDELUXA commented Sep 22, 2025 •

edited

Loading

RandomGitUser321 commented Sep 22, 2025 •

edited

Loading

0xDELUXA commented Sep 22, 2025 •

edited

Loading