🚨All attention refactor🚨 #35235

ArthurZucker · 2024-12-12T13:39:35Z

What does this PR do?

Todo in this PR:

ArthurZucker · 2024-12-13T18:26:41Z

src/transformers/modeling_utils.py

+)
+
+
+class GradientCheckpointLayer(torch.nn.Module):


This should help with kwargs as well

…ngface#36024) * update * update * update * dev-ci * more changes * fix * fix * fix --------- Co-authored-by: ydshieh <[email protected]>

Breaking change in transformers is huggingface/transformers#35235. Need to make changes to unpin nv-a6000 workflow. Signed-off-by: gyou2021 <[email protected]>

Breaking change in transformers is huggingface/transformers#35235. Need to make changes to unpin nv-a6000 workflow. Signed-off-by: yisheng <[email protected]>

* refactor LlamaAttention * minimal changes * fix llama * update * modular gemmas * modular nits * modular updates * nits * simplify * gpt2 * more modualr and fixes * granite * modular modular modular * nits * update * qwen2 + starcoder2 * mostly gemma2 * Update image_processing_auto.py * fix * Update modular_starcoder2.py * fix * remove all copied from attentions * remove gcv * make fix-copies * oups * oups2.0 * fix some modulars + all copied from * should be good now * revert unwanted changes * Update modeling_decision_transformer.py * finish cleanup * Update modeling_olmo.py * consistency * re-add gradient checkpointing attribute * fix * style * make config necessary * bis * bis * Update modeling_my_new_model2.py * is_causal attr * fix * remove past kv return from decoder layer * fix * default rope config * correctly fix rope config * fix bias * fix gpt2 attention output * fix test * fix inits * fix default sdpa * fix default sdpa implementation * harmonize classes * fix mistral * fix sliding window models * mixtral * be more explicit * style * fix * several fixes * Update modeling_dbrx.py * fix test * olmo + phi * rotary * syle * phi * phi again * again * kwargs * Update test_modeling_common.py * skip fx tracing tests * Update modeling_utils.py * gemma 2 * again * Update modeling_recurrent_gemma.py * gemma2 * granite * style * starcoder * Update sdpa_attention.py * switch args * Update modeling_mllama.py * fix * cache type tests * gpt2 * Update test_modeling_common.py * fix * consistency * fix shape with encoder * should be the last one * tests non model * most comments * small oupsi * be more explicit in modulars * more explicit modulars * CIs! it works locally * add kwargs to _flash_attention_forward --------- Co-authored-by: Cyril Vallez <[email protected]>

Breaking change in transformers is huggingface/transformers#35235. Need to make changes to unpin nv-a6000 workflow.

@Isotr0py

# Adds support for `transformers` as a backend Following huggingface/transformers#35235, a bunch of models should already be supported, we are ramping up support for more models. Thanks @Isotr0py for the TP support, and @hmellor for his help as well! This includes: - `trust_remote_code=True` support: any model on the hub, if it implements attention the correct way can be natively supported!! - tensor parallel support --------- Signed-off-by: Harry Mellor <[email protected]> Signed-off-by: Isotr0py <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: Harry Mellor <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: Isotr0py <[email protected]>

See huggingface/transformers#35235 (comment) for context. There has been a refactor in transformers that resulted in the rotary embedding of Mistral (and probably others) moving to the model level. This led to a device map used in one of the tests to being incorrect. This PR fixes the device map. Note that this fix doesn't really have anything to do with prefix tuning, the error occurred even before prefix tuning is used.

The changes in huggingface/transformers#35235 resulted in a couple of adaption prompt tests to fail. This PR fixes these failures while maintaining compatibility with older transformers versions. Required changes: - hidden_size attribute removed from model, now config.hidden_size - num_heads attribute removed from model, now config.num_attention_heads - forward now returns 2 outputs instead of 3, rewritten to be agnostic towards the number of outputs

@Isotr0py

# Adds support for `transformers` as a backend Following huggingface/transformers#35235, a bunch of models should already be supported, we are ramping up support for more models. Thanks @Isotr0py for the TP support, and @hmellor for his help as well! This includes: - `trust_remote_code=True` support: any model on the hub, if it implements attention the correct way can be natively supported!! - tensor parallel support --------- Signed-off-by: Harry Mellor <[email protected]> Signed-off-by: Isotr0py <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: Harry Mellor <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: Isotr0py <[email protected]>

yuanyao-nv · 2025-06-30T23:42:27Z

src/transformers/integrations/sdpa_attention.py

+    **kwargs,
+) -> Tuple[torch.Tensor, None]:
+    if hasattr(module, "num_key_value_groups"):
+        key = repeat_kv(key, module.num_key_value_groups)


@ArthurZucker Wondering if you know whether this repeat_kv is needed here? torch.nn.functional.scaled_dot_product_attention allows num heads to be different for q and k/v, and having this expansion just defeats the purpose of doing GQA.

@yuanyao-nv Sadly, yes it is due to multiple reasons:

GQA support was only later added to sdpa (we support up to torch 2.1)

The GQA support in sdpa is very limited

Quoting https://docs.pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html "It currently works only for Flash_attention and math kernel on CUDA tensor"

However, since we more often than not use a mask, we require the memory-efficient backend (fa cannot work with masks in sdpa) to avoid the least efficient math backend

Thanks for the reply!
With regard to the flash attention support gap, how does introducing repeat_kv solve the issue? Does flash attention have better support for masked cases in non-GQA than GQA?
The reason I'm asking is that I'm trying to export models into FX graph and ONNX and I'm trying to preserve the SDPA op for Attention in the FX graph. The repeat_kv introduces additional ops that can lead to inefficiencies when a non-torch backend tries to parse such graphs.
I'm wondering if the use of repeat_kv should be configurable based on which backend is intended?

@vasqu @ArthurZucker any thoughts on the above questions? Thanks.

Another followup question (might be a naive one):
The use case you're referring to above seems to be attn_implementation="sdpa" and also explicitly requesting the FA backend (maybe via a context manager with sdpa_kernel(backends=[SDPBackend.FLASH_ATTENTION]):?)
However, if the user really intends to use FA, shouldn't they directly set attn_imlementation="flash_attention_2" or "flash_attention_3"? ie, why would they set attn_implementation="sdpa" and then request the FA kernel?

Yes, in general using flash_attention_2 is just the best / recommended way!

Thanks for confirming. In that case is there anything stopping us from removing repeat_kv from the sdpa_attention path? It sounds like if the user really wants to use the FA backend they should not need to use this path in the first place.
And this will make the exported fx graph and ONNX graph more efficient and easier for other backends to work with.
I'd be happy to submit a PR for this.

I think #35235 (comment) still explains most reasons:

enable_gqa is not available for earlier versions of torch (starts with 2.5.x)

If we have a mask, we need to avoid the math kernel (glad to see benchmarks to prove me wrong there) and instead use the xformers one (memory-efficient backend)

enable_gqa works only with fa or math

If we have a mask, it already cuts out the fa kernel (torch internal restriction)

Fallback to math kernel

It's not about users requesting specific backends but moreso to avoid entering inefficient branches (math kernel) - if a user uses SDPA they should expect the more efficient backends...

The original flash attention is better in that way but SDPA is the standard attention as it's native torch.

FYI, there is #39412 which partially enables this kwarg

See huggingface/transformers#35235 (comment) for context. There has been a refactor in transformers that resulted in the rotary embedding of Mistral (and probably others) moving to the model level. This led to a device map used in one of the tests to being incorrect. This PR fixes the device map. Note that this fix doesn't really have anything to do with prefix tuning, the error occurred even before prefix tuning is used.

The changes in huggingface/transformers#35235 resulted in a couple of adaption prompt tests to fail. This PR fixes these failures while maintaining compatibility with older transformers versions. Required changes: - hidden_size attribute removed from model, now config.hidden_size - num_heads attribute removed from model, now config.num_attention_heads - forward now returns 2 outputs instead of 3, rewritten to be agnostic towards the number of outputs

ArthurZucker force-pushed the all-attention-refactor branch from 0dc9253 to d1aa9ce Compare December 12, 2024 13:49

ArthurZucker commented Dec 13, 2024

View reviewed changes

ArthurZucker mentioned this pull request Dec 16, 2024

Add ModernBERT to Transformers #35158

Merged

ArthurZucker and others added 17 commits December 16, 2024 10:14

refactor LlamaAttention

79cb53c

minimal changes

4bb485b

fix llama

f370907

update

d3ef539

modular gemmas

45eac58

modular nits

e52af49

modular updates

5ed37ae

nits

38cafc1

simplify

a862eac

gpt2

5639b81

more modualr and fixes

452d8ed

granite

81a0b66

modular modular modular

bc72c3f

nits

48caa89

update

df68dd0

qwen2 + starcoder2

0325dc4

mostly gemma2

ecd814b

Cyrilvallez force-pushed the all-attention-refactor branch from 8b56823 to ecd814b Compare December 16, 2024 11:28

Cyrilvallez and others added 9 commits December 16, 2024 12:39

Update image_processing_auto.py

f5fc638

fix

5e56d9c

Update modular_starcoder2.py

598b7bb

fix

0f565fb

remove all copied from attentions

c9ac84d

remove gcv

d189fe7

make fix-copies

9c83d96

oups

138368e

oups2.0

7225a4f

hlky mentioned this pull request Feb 14, 2025

Flash Attention v3 #36190

Closed

kevinstephano mentioned this pull request Feb 20, 2025

Support packing multiple sequences with Flash Attention without cross-contamination Lightning-AI/lightning-thunder#1758

Closed

KuuCi mentioned this pull request Mar 10, 2025

Bump Transformer v4.49.0 mosaicml/llm-foundry#1735

Merged

mauryaavinash95 pushed a commit to DataStates/DeepSpeed that referenced this pull request Mar 20, 2025

Pin nv-a6000 workflow (deepspeedai#6938)

bdf5dc5

Breaking change in transformers is huggingface/transformers#35235. Need to make changes to unpin nv-a6000 workflow.

DerekLiu35 mentioned this pull request Apr 9, 2025

Fix some failing AWQ tests #37383

Merged

vasqu mentioned this pull request Apr 22, 2025

Add GraniteMoeHybrid support for 4.0 #37658

Merged

5 tasks

i-colbert mentioned this pull request May 21, 2025

Attention refactor in #35235 adds a __getitem__ into the forward pass, which causes errors with torch dynamo. #38271

Closed

4 tasks

nhatkhtn mentioned this pull request Jun 10, 2025

LlamaAttention forward function type hint is incorrect #38739

Closed

4 tasks

This was referenced Jun 11, 2025

LlamaAttention forward function type hint is incorrect #38739 #38769

Closed

LlamaAttention forward function type hint is incorrect #38739 #38795

Closed

LlamaAttention forward function type hint is incorrect from new Branch #38855

Closed

ArkVex mentioned this pull request Jun 24, 2025

LlamaAttention forward function type hint is incorrect from new Branch #38998

Merged

yuanyao-nv reviewed Jun 30, 2025

View reviewed changes

vasqu mentioned this pull request Jul 15, 2025

use the enable_gqa param in torch.nn.functional.scaled_dot_product_at… #39412

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🚨All attention refactor🚨 #35235

🚨All attention refactor🚨 #35235

Uh oh!

ArthurZucker commented Dec 12, 2024 •

edited by Cyrilvallez

Loading

Uh oh!

ArthurZucker Dec 13, 2024

Uh oh!

yuanyao-nv Jun 30, 2025

Uh oh!

vasqu Jul 1, 2025

Uh oh!

ArthurZucker Jul 1, 2025

Uh oh!

yuanyao-nv Jul 1, 2025

Uh oh!

yuanyao-nv Jul 2, 2025

Uh oh!

yuanyao-nv Jul 15, 2025

Uh oh!

ArthurZucker Jul 15, 2025

Uh oh!

yuanyao-nv Jul 15, 2025

Uh oh!

vasqu Jul 17, 2025

Uh oh!

vasqu Jul 17, 2025

Uh oh!

Uh oh!

		)


		class GradientCheckpointLayer(torch.nn.Module):

🚨All attention refactor🚨 #35235

🚨All attention refactor🚨 #35235

Uh oh!

Conversation

ArthurZucker commented Dec 12, 2024 • edited by Cyrilvallez Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ArthurZucker commented Dec 12, 2024 •

edited by Cyrilvallez

Loading