NPUW: Introduce attention hints, allow different kvcache layouts #32284

dmatveev · 2025-10-02T21:21:31Z

Details:

Yet another part of the bigger NPUW: First sketches to handle dynamic attention #32000
Introduces individual ATTENTION_HINTs for prefill & generate stages
In this PR, all DYNAMIC hint does is cancel the SDPA unroll & v-tensor transpose
When prefill & generate hints are different, the prefill-to-generate kvcache copy is in fact a permute (a new utility added)
When paired with NPUW: Introduce a pattern to handle SDPA #32283, the SDPA subgraph will go separate once hint is DYNAMIC

A proper copy should handle it, but there's an assert for now

…ing and serialization

AlexanderKalistratov · 2025-10-03T03:11:12Z

src/plugins/intel_npu/src/plugin/npuw/llm_infer_request.cpp

+                        uint32_t kv_dim_src,
+                        uint32_t kv_dim_dst) {
+    if (kv_dim_src != kv_dim_dst) {
+        // new case - do a generic copy for now (in fact it is a permute)


I think we may leave it as is for now, but isn't first dimension is a batch and isn't it always == 1?

In this case we don't need 4d permute. And we have already covered and optimized most (but not all) cases for 3d permute.

we don't need a generic 4d permute for sure, but I didn't want to handle the particular cases here as it'd complicate the flow at this stage

src/plugins/intel_npu/src/plugin/npuw/llm_compiled_model.cpp

src/plugins/intel_npu/src/al/include/intel_npu/npuw_private_properties.hpp

AsyaPronina · 2025-10-03T12:01:33Z

src/plugins/intel_npu/src/plugin/npuw/llm_compiled_model.cpp

+        }
+        if (!prefill_attn_dyn && ov::npuw::util::optimize_value_tensors(prefill_model, true)) {
+            LOG_DEBUG("V-tensors tranposed in prefill model");
+            m_kvcache_desc.v_tensors_transposed_pre = true;


This a bit changes default behaviour: previously if transpose wasn't needed, then we applied SDPA unroll for generate model, but not for prefill. (As transpose returned false then we didn't go and apply the same transformations to prefill)

Now we apply unroll both to generate and prefill regardless if transpose transformation for generate returned true or false.

Is it expected? It seems more correct now, by the way

I am not sure what do you refer to as to the previous behavior.

SDPA unroll is a part of the optimize_value_tensors routine. If transpose wasn't needed (I'd rather call it "cancelled" as it is "needed" by default), neither transformations were called. Where did we apply the SDPA unroll for the generate model in this case, am I missing something?

When it was actually "needed", we applied the transformation first to the kvcache model and IF IT WAS applied, required it (via assert) to be applied to the prefill model as well.

Or do you refer to the fact that optimize_value_tensors could unroll SDPA in one model but return false as it couldn't transpose the v-tensor? That behavior was rather an inconsistency than something we should keep at all costs. It is still the case btw, we can unroll the SDPA for no actual reason (so v-tensors won't be transposed).

Probably a way to workaround this would be to clone the model within the pass and then return either original model if the transpose pass has failed, or the transformed one if it was actually applied. That's probably a thing - feel free to file a task on this for the future. Thanks!

AsyaPronina · 2025-10-03T13:35:06Z

src/plugins/intel_npu/src/plugin/npuw/util.cpp

+void ov::npuw::util::permute_i4d(const ov::SoPtr<ov::ITensor>& src,
+                                 ov::SoPtr<ov::ITensor>& dst,
+                                 const std::array<int, 4> order) {
+    const auto& src_shape = src->get_shape();


Do we need is_continuous() checks somewhere?

This permute respects strides so we don't

src/plugins/intel_npu/src/al/include/intel_npu/npuw_private_properties.hpp

src/plugins/intel_npu/src/plugin/npuw/llm_compiled_model.cpp

dmatveev · 2025-10-03T19:00:35Z

src/plugins/intel_npu/src/plugin/npuw/llm_compiled_model.cpp

+        }
+        if (!prefill_attn_dyn && ov::npuw::util::optimize_value_tensors(prefill_model, true)) {
+            LOG_DEBUG("V-tensors tranposed in prefill model");
+            m_kvcache_desc.v_tensors_transposed_pre = true;


I am not sure what do you refer to as to the previous behavior.

SDPA unroll is a part of the optimize_value_tensors routine. If transpose wasn't needed (I'd rather call it "cancelled" as it is "needed" by default), neither transformations were called. Where did we apply the SDPA unroll for the generate model in this case, am I missing something?

When it was actually "needed", we applied the transformation first to the kvcache model and IF IT WAS applied, required it (via assert) to be applied to the prefill model as well.

Or do you refer to the fact that optimize_value_tensors could unroll SDPA in one model but return false as it couldn't transpose the v-tensor? That behavior was rather an inconsistency than something we should keep at all costs. It is still the case btw, we can unroll the SDPA for no actual reason (so v-tensors won't be transposed).

Probably a way to workaround this would be to clone the model within the pass and then return either original model if the transpose pass has failed, or the transformed one if it was actually applied. That's probably a thing - feel free to file a task on this for the future. Thanks!

dmatveev · 2025-10-03T19:25:24Z

src/plugins/intel_npu/src/plugin/npuw/util.cpp

+                    };
+                    const auto dst_o =
+                        v_dst[0] * dst_s[0] + v_dst[1] * dst_s[1] + v_dst[2] * dst_s[2] + v_dst[3] * dst_s[3];
+                    std::copy_n(src_p + src_o, elem_size, dst_p + dst_o);


So based on my paper math this approach look correct.
This permute assumes that two dimensions swap their places (order), so given the position in the source tensor (i/j/k/l) we find the right position in the dst tensor with the remapped vector (i/j/l/k) in this case. As the axis order defines the physical layout, we use the src/dst strides in the same order as given. Finally, we copy $(element_size) bytes for a single element with copy_n.

…operties.hpp

dmatveev added 3 commits October 2, 2025 20:40

DynCtx refactornig 2.1: Introduce option mnemonics

b632224

DynCtx refactoring 2.2: Allow different KV layouts for pre- & gen-

85adaa4

A proper copy should handle it, but there's an assert for now

DynCtx refactor 2.3: allow different layors for v-tensor

527a883

dmatveev requested review from a team as code owners October 2, 2025 21:21

Fix clang-format & warnings

e3c4725

github-actions bot added category: NPU OpenVINO NPU plugin category: NPUW NPUW plugin labels Oct 2, 2025

dmatveev mentioned this pull request Oct 2, 2025

NPUW: First sketches to handle dynamic attention #32000

Draft

dmatveev added this to the 2025.4 milestone Oct 2, 2025

NPUW: Address some feedback from openvinotoolkit#32000 review on cach…

3608649

…ing and serialization

AlexanderKalistratov reviewed Oct 3, 2025

View reviewed changes

smirnov-alexey reviewed Oct 3, 2025

View reviewed changes

src/plugins/intel_npu/src/plugin/npuw/llm_compiled_model.cpp Outdated Show resolved Hide resolved

smirnov-alexey approved these changes Oct 3, 2025

View reviewed changes

AsyaPronina reviewed Oct 3, 2025

View reviewed changes

dmatveev commented Oct 3, 2025

View reviewed changes

dmatveev added 4 commits October 3, 2025 20:27

Update src/plugins/intel_npu/src/al/include/intel_npu/npuw_private_pr…

8e34655

…operties.hpp

Update src/plugins/intel_npu/src/plugin/npuw/llm_compiled_model.cpp

a204c95

Update src/plugins/intel_npu/src/plugin/npuw/llm_compiled_model.cpp

ea530ab

One more clang-format fix

a208e02

dmatveev self-assigned this Oct 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

NPUW: Introduce attention hints, allow different kvcache layouts #32284

NPUW: Introduce attention hints, allow different kvcache layouts #32284

dmatveev commented Oct 2, 2025 •

edited

Loading

Uh oh!

AlexanderKalistratov Oct 3, 2025

Uh oh!

dmatveev Oct 3, 2025

Uh oh!

Uh oh!

Uh oh!

AsyaPronina Oct 3, 2025 •

edited

Loading

Uh oh!

dmatveev Oct 3, 2025

Uh oh!

AsyaPronina Oct 3, 2025

Uh oh!

dmatveev Oct 3, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dmatveev Oct 3, 2025

Uh oh!

dmatveev Oct 3, 2025

Uh oh!

Uh oh!

NPUW: Introduce attention hints, allow different kvcache layouts #32284

Are you sure you want to change the base?

NPUW: Introduce attention hints, allow different kvcache layouts #32284

Conversation

dmatveev commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Details:

Uh oh!

AlexanderKalistratov Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

dmatveev Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

AsyaPronina Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dmatveev Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

AsyaPronina Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

dmatveev Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dmatveev Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

dmatveev Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dmatveev commented Oct 2, 2025 •

edited

Loading

AsyaPronina Oct 3, 2025 •

edited

Loading