-
Notifications
You must be signed in to change notification settings - Fork 2.7k
NPUW: Introduce attention hints, allow different kvcache layouts #32284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
A proper copy should handle it, but there's an assert for now
…ing and serialization
uint32_t kv_dim_src, | ||
uint32_t kv_dim_dst) { | ||
if (kv_dim_src != kv_dim_dst) { | ||
// new case - do a generic copy for now (in fact it is a permute) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we may leave it as is for now, but isn't first dimension is a batch and isn't it always == 1?
In this case we don't need 4d permute. And we have already covered and optimized most (but not all) cases for 3d permute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't need a generic 4d permute for sure, but I didn't want to handle the particular cases here as it'd complicate the flow at this stage
src/plugins/intel_npu/src/al/include/intel_npu/npuw_private_properties.hpp
Outdated
Show resolved
Hide resolved
} | ||
if (!prefill_attn_dyn && ov::npuw::util::optimize_value_tensors(prefill_model, true)) { | ||
LOG_DEBUG("V-tensors tranposed in prefill model"); | ||
m_kvcache_desc.v_tensors_transposed_pre = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This a bit changes default behaviour: previously if transpose wasn't needed, then we applied SDPA unroll for generate model, but not for prefill. (As transpose returned false
then we didn't go and apply the same transformations to prefill)
Now we apply unroll both to generate and prefill regardless if transpose transformation for generate returned true
or false
.
Is it expected? It seems more correct now, by the way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure what do you refer to as to the previous behavior.
SDPA unroll is a part of the optimize_value_tensors
routine. If transpose wasn't needed (I'd rather call it "cancelled" as it is "needed" by default), neither transformations were called. Where did we apply the SDPA unroll for the generate model in this case, am I missing something?
When it was actually "needed", we applied the transformation first to the kvcache model and IF IT WAS applied, required it (via assert) to be applied to the prefill model as well.
Or do you refer to the fact that optimize_value_tensors
could unroll SDPA in one model but return false as it couldn't transpose the v-tensor? That behavior was rather an inconsistency than something we should keep at all costs. It is still the case btw, we can unroll the SDPA for no actual reason (so v-tensors won't be transposed).
Probably a way to workaround this would be to clone the model within the pass and then return either original model if the transpose pass has failed, or the transformed one if it was actually applied. That's probably a thing - feel free to file a task on this for the future. Thanks!
void ov::npuw::util::permute_i4d(const ov::SoPtr<ov::ITensor>& src, | ||
ov::SoPtr<ov::ITensor>& dst, | ||
const std::array<int, 4> order) { | ||
const auto& src_shape = src->get_shape(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need is_continuous()
checks somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This permute respects strides so we don't
src/plugins/intel_npu/src/al/include/intel_npu/npuw_private_properties.hpp
Outdated
Show resolved
Hide resolved
} | ||
if (!prefill_attn_dyn && ov::npuw::util::optimize_value_tensors(prefill_model, true)) { | ||
LOG_DEBUG("V-tensors tranposed in prefill model"); | ||
m_kvcache_desc.v_tensors_transposed_pre = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure what do you refer to as to the previous behavior.
SDPA unroll is a part of the optimize_value_tensors
routine. If transpose wasn't needed (I'd rather call it "cancelled" as it is "needed" by default), neither transformations were called. Where did we apply the SDPA unroll for the generate model in this case, am I missing something?
When it was actually "needed", we applied the transformation first to the kvcache model and IF IT WAS applied, required it (via assert) to be applied to the prefill model as well.
Or do you refer to the fact that optimize_value_tensors
could unroll SDPA in one model but return false as it couldn't transpose the v-tensor? That behavior was rather an inconsistency than something we should keep at all costs. It is still the case btw, we can unroll the SDPA for no actual reason (so v-tensors won't be transposed).
Probably a way to workaround this would be to clone the model within the pass and then return either original model if the transpose pass has failed, or the transformed one if it was actually applied. That's probably a thing - feel free to file a task on this for the future. Thanks!
}; | ||
const auto dst_o = | ||
v_dst[0] * dst_s[0] + v_dst[1] * dst_s[1] + v_dst[2] * dst_s[2] + v_dst[3] * dst_s[3]; | ||
std::copy_n(src_p + src_o, elem_size, dst_p + dst_o); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So based on my paper math this approach look correct.
This permute assumes that two dimensions swap their places (order), so given the position in the source tensor (i/j/k/l)
we find the right position in the dst tensor with the remapped vector (i/j/l/k)
in this case. As the axis order defines the physical layout, we use the src/dst strides in the same order as given. Finally, we copy $(element_size)
bytes for a single element with copy_n
.
Details:
ATTENTION_HINT
s for prefill & generate stagesDYNAMIC
hint does is cancel the SDPA unroll & v-tensor transpose