Skip to content

Conversation

dmatveev
Copy link
Contributor

@dmatveev dmatveev commented Oct 2, 2025

Details:

@dmatveev dmatveev requested review from a team as code owners October 2, 2025 21:21
@github-actions github-actions bot added category: NPU OpenVINO NPU plugin category: NPUW NPUW plugin labels Oct 2, 2025
@dmatveev dmatveev added this to the 2025.4 milestone Oct 2, 2025
uint32_t kv_dim_src,
uint32_t kv_dim_dst) {
if (kv_dim_src != kv_dim_dst) {
// new case - do a generic copy for now (in fact it is a permute)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we may leave it as is for now, but isn't first dimension is a batch and isn't it always == 1?

In this case we don't need 4d permute. And we have already covered and optimized most (but not all) cases for 3d permute.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need a generic 4d permute for sure, but I didn't want to handle the particular cases here as it'd complicate the flow at this stage

}
if (!prefill_attn_dyn && ov::npuw::util::optimize_value_tensors(prefill_model, true)) {
LOG_DEBUG("V-tensors tranposed in prefill model");
m_kvcache_desc.v_tensors_transposed_pre = true;
Copy link
Contributor

@AsyaPronina AsyaPronina Oct 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This a bit changes default behaviour: previously if transpose wasn't needed, then we applied SDPA unroll for generate model, but not for prefill. (As transpose returned false then we didn't go and apply the same transformations to prefill)

Now we apply unroll both to generate and prefill regardless if transpose transformation for generate returned true or false.

Is it expected? It seems more correct now, by the way

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what do you refer to as to the previous behavior.

SDPA unroll is a part of the optimize_value_tensors routine. If transpose wasn't needed (I'd rather call it "cancelled" as it is "needed" by default), neither transformations were called. Where did we apply the SDPA unroll for the generate model in this case, am I missing something?

When it was actually "needed", we applied the transformation first to the kvcache model and IF IT WAS applied, required it (via assert) to be applied to the prefill model as well.

Or do you refer to the fact that optimize_value_tensors could unroll SDPA in one model but return false as it couldn't transpose the v-tensor? That behavior was rather an inconsistency than something we should keep at all costs. It is still the case btw, we can unroll the SDPA for no actual reason (so v-tensors won't be transposed).

Probably a way to workaround this would be to clone the model within the pass and then return either original model if the transpose pass has failed, or the transformed one if it was actually applied. That's probably a thing - feel free to file a task on this for the future. Thanks!

void ov::npuw::util::permute_i4d(const ov::SoPtr<ov::ITensor>& src,
ov::SoPtr<ov::ITensor>& dst,
const std::array<int, 4> order) {
const auto& src_shape = src->get_shape();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need is_continuous() checks somewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This permute respects strides so we don't

}
if (!prefill_attn_dyn && ov::npuw::util::optimize_value_tensors(prefill_model, true)) {
LOG_DEBUG("V-tensors tranposed in prefill model");
m_kvcache_desc.v_tensors_transposed_pre = true;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what do you refer to as to the previous behavior.

SDPA unroll is a part of the optimize_value_tensors routine. If transpose wasn't needed (I'd rather call it "cancelled" as it is "needed" by default), neither transformations were called. Where did we apply the SDPA unroll for the generate model in this case, am I missing something?

When it was actually "needed", we applied the transformation first to the kvcache model and IF IT WAS applied, required it (via assert) to be applied to the prefill model as well.

Or do you refer to the fact that optimize_value_tensors could unroll SDPA in one model but return false as it couldn't transpose the v-tensor? That behavior was rather an inconsistency than something we should keep at all costs. It is still the case btw, we can unroll the SDPA for no actual reason (so v-tensors won't be transposed).

Probably a way to workaround this would be to clone the model within the pass and then return either original model if the transpose pass has failed, or the transformed one if it was actually applied. That's probably a thing - feel free to file a task on this for the future. Thanks!

};
const auto dst_o =
v_dst[0] * dst_s[0] + v_dst[1] * dst_s[1] + v_dst[2] * dst_s[2] + v_dst[3] * dst_s[3];
std::copy_n(src_p + src_o, elem_size, dst_p + dst_o);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So based on my paper math this approach look correct.
This permute assumes that two dimensions swap their places (order), so given the position in the source tensor (i/j/k/l) we find the right position in the dst tensor with the remapped vector (i/j/l/k) in this case. As the axis order defines the physical layout, we use the src/dst strides in the same order as given. Finally, we copy $(element_size) bytes for a single element with copy_n.

@dmatveev dmatveev self-assigned this Oct 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: NPU OpenVINO NPU plugin category: NPUW NPUW plugin
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants