Skip to content

Conversation

vasqu
Copy link
Contributor

@vasqu vasqu commented Sep 25, 2025

These embedding types are barely used and make the modeling files just more complex without justifying their existence. Position embedding types still exist in a few models; this PR just addresses the relative_key(_query) ones.

Some stats:

  • None of the slow tests use them except bert
  • The respective models in those tests together have less than 2k downloads in the last month

cc @hmellor this should remove any clashes with the kwargs you encountered in vLLM :D

@vasqu
Copy link
Contributor Author

vasqu commented Sep 25, 2025

run-slow: flava, instructblib, mra

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is mostly due to me forgetting to update them in my bert refactor PR --> big diff because the whole refactor is included (same for the roberta example)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated: Only includes the changes here now

Copy link
Contributor

This comment contains run-slow, running the specified jobs:

models: ['models/flava', 'models/mra']
quantizations: [] ...

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@vasqu
Copy link
Contributor Author

vasqu commented Sep 25, 2025

run-slow: instructblip

Copy link
Contributor

This comment contains run-slow, running the specified jobs:

models: ['models/instructblip']
quantizations: [] ...

@vasqu vasqu force-pushed the remove-relative-positions-bert-likes branch from 6046d27 to 0dbd18b Compare September 25, 2025 18:49
@vasqu vasqu marked this pull request as ready for review September 26, 2025 11:14
@vasqu
Copy link
Contributor Author

vasqu commented Sep 26, 2025

run-slow: bert, roberta, albert, mra, instructblip, blip_2, flava

Copy link
Contributor

This comment contains run-slow, running the specified jobs:

models: ['models/albert', 'models/bert', 'models/blip_2', 'models/flava', 'models/instructblip', 'models/mra', 'models/roberta']
quantizations: [] ...

@vasqu
Copy link
Contributor Author

vasqu commented Sep 26, 2025

Failing slow tests are the same as in main 👀

Copy link
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: albert, align, altclip, bert, bert_generation, big_bird, blip, blip_2, bridgetower, bros, camembert, canine, chinese_clip, clap, data2vec, dpr

@vasqu vasqu added the for_v5? label Sep 30, 2025
Copy link
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, super nice clean-up! 🧼

return embeddings


def eager_attention_forward(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think now we can copy bert from llama or another big model group? 👀 Keeping less sources of truth makes it easier to submit PRs

Copy link
Contributor Author

@vasqu vasqu Oct 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me make a follow-up PR for that, would like to sync bert and bart instead tho since llama would indicate causal masks which is not the case here + unnecessary gqa dependency from llama

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened #41248 for the sync

Comment on lines -505 to -520
if self.position_embedding_type == "relative_key" or self.position_embedding_type == "relative_key_query":
seq_length = hidden_states.size()[1]
position_ids_l = torch.arange(seq_length, dtype=torch.long, device=hidden_states.device).view(-1, 1)
position_ids_r = torch.arange(seq_length, dtype=torch.long, device=hidden_states.device).view(1, -1)
distance = position_ids_l - position_ids_r
positional_embedding = self.distance_embedding(distance + self.max_position_embeddings - 1)
positional_embedding = positional_embedding.to(dtype=query_layer.dtype) # fp16 compatibility

if self.position_embedding_type == "relative_key":
relative_position_scores = torch.einsum("bhld,lrd->bhlr", query_layer, positional_embedding)
attention_scores = attention_scores + relative_position_scores
elif self.position_embedding_type == "relative_key_query":
relative_position_scores_query = torch.einsum("bhld,lrd->bhlr", query_layer, positional_embedding)
relative_position_scores_key = torch.einsum("bhrd,lrd->bhlr", key_layer, positional_embedding)
attention_scores = attention_scores + relative_position_scores_query + relative_position_scores_key

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

happy to see it, I assumed that BLIP models use relative positions haha. Now it can support attention implementation API in qformer 🙌🏻

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, it's honestly a bit baffling how many models have this while not using it at all 👀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants