-
Notifications
You must be signed in to change notification settings - Fork 3.5k
Reworked sgemm_kleidi memory allocations to reuse memory buffers #26166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Reworked sgemm_kleidi memory allocations to reuse memory buffers #26166
Conversation
…storage Signed-off-by: Jonathan Clohessy <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR introduces a good performance optimization by reusing memory buffers through thread_local
storage. The approach seems sound, especially for reducing allocation overhead on platforms like Android, as highlighted in the description.
The use of std::vector::reserve
and std::vector::resize
for managing the thread-local buffers is appropriate for this kind of optimization.
One minor question regarding the lhs_packed
and rhs_packed
buffers:
if (g_kai_tls.lhs_packed.capacity() < LhsPackedStride * BatchSize) {
g_kai_tls.lhs_packed.reserve(LhsPackedStride * BatchSize);
}
g_kai_tls.lhs_packed.resize(LhsPackedStride * BatchSize);
This logic ensures the capacity grows as needed. Is there any scenario where the LhsPackedStride * BatchSize
could become significantly smaller than the current capacity, leading to a large amount of unused reserved memory? While std::vector
doesn't shrink its capacity automatically, it's generally fine for performance-critical buffers that tend to grow or stabilize at a certain size. Just confirming this is the intended behavior and not a concern for this specific use case.
LhsPackedData = LhsPacked.get(); | ||
|
||
std::unique_ptr<std::byte[]> RhsPacked{nullptr}; | ||
if (g_kai_tls.lhs_packed.capacity() < LhsPackedStride * BatchSize) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any measure to avoid LhsPackedStride * BatchSize having integer overflow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a fair point, I can add some checks before resizing to try protect against overflow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a new function to mlasi_kleidi.h which uses a gcc/clang built in for checking the wrap around for size_t, I also added a generic implementation for systems without the builtin.
Yes, this is the intended behavior. You are correct that if the required buffer size later becomes smaller, the capacity may remain larger than strictly necessary, leading to some unused reserved memory. In practice, we expect the footprint to plateau based on the model/matrices under test, and the maximum allocation would be the same with or without this approach. We deliberately avoid shrinking the vectors, since at this level we cannot predict future operations, and frequent shrink/grow cycles could be costly. The trade-off is slightly higher sustained memory usage, but with more predictable performance. |
std::vector<std::byte> rhs_packed; | ||
std::vector<std::byte> lhs_packed; | ||
}; | ||
static thread_local KaiTlsBuffers g_kai_tls; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please note that ORT as a library could not control when a thread_local variable is deallocated. As the destructor of this var does not depend on anything except the C++ runtime, I feel it's fine.
…t throughout Sgemm Signed-off-by: Jonathan Clohessy <[email protected]>
if (mul_overflow_size_t_builtin(TileSizeM, TileSizeN, &tile_elems)) | ||
{ | ||
// size_t wraparound detected, exit | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing semi-colon?
also, this will just return early from the worker thread and not actually exit the function. perhaps this check should be moved before the MlasTrySimpleParallel
call using the maximum possible values of TileSizeM
and TileSizeN
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies, that semi-colon was a typo on my part.
I have moved that logic to outside the MlasTrySimpleParallel so we can return based on the max size
Signed-off-by: Jonathan Clohessy <[email protected]>
Key changes
This PR makes changes to KleidiAI integration within the existing sgemm_kleidiai.cpp implementation.
It was noted that during internal testing that memory allocation overhead due to repeated allocations of vectors was having a negative impact on performance figures.
The changes introduce thread local buffers for reusing memory during inference.
Android platforms are particularly sensitive to this, we have observed inference times being significantly impacted due to memory allocation overheads
Example performance
All runs were captured using onnxruntime_perf_test

e.g. onnxruntime_perf_test -v -e cpu -I -m times -x 1 -y 1 -r 1
Android Platform
In addition to this on M4 we have also observed slight improvements on models, however its the gain is not as significant as the allocation overhead is lower in terms of total time on that platform
Mac Mini M4
