Skip to content

Conversation

JonathanC-ARM
Copy link
Contributor

Key changes

This PR makes changes to KleidiAI integration within the existing sgemm_kleidiai.cpp implementation.

It was noted that during internal testing that memory allocation overhead due to repeated allocations of vectors was having a negative impact on performance figures.

The changes introduce thread local buffers for reusing memory during inference.

Android platforms are particularly sensitive to this, we have observed inference times being significantly impacted due to memory allocation overheads

Example performance

All runs were captured using onnxruntime_perf_test
e.g. onnxruntime_perf_test -v -e cpu -I -m times -x 1 -y 1 -r 1
Android Platform
image

In addition to this on M4 we have also observed slight improvements on models, however its the gain is not as significant as the allocation overhead is lower in terms of total time on that platform

Mac Mini M4
image

@JonathanC-ARM JonathanC-ARM changed the title Reworked memory allocations in sgemm_kleidi reuse memory buffers Reworked sgemm_kleidi memory allocations to reuse memory buffers Sep 25, 2025
Copy link
Member

@snnn snnn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR introduces a good performance optimization by reusing memory buffers through thread_local storage. The approach seems sound, especially for reducing allocation overhead on platforms like Android, as highlighted in the description.

The use of std::vector::reserve and std::vector::resize for managing the thread-local buffers is appropriate for this kind of optimization.

One minor question regarding the lhs_packed and rhs_packed buffers:

if (g_kai_tls.lhs_packed.capacity() < LhsPackedStride * BatchSize) {
    g_kai_tls.lhs_packed.reserve(LhsPackedStride * BatchSize);
}
g_kai_tls.lhs_packed.resize(LhsPackedStride * BatchSize);

This logic ensures the capacity grows as needed. Is there any scenario where the LhsPackedStride * BatchSize could become significantly smaller than the current capacity, leading to a large amount of unused reserved memory? While std::vector doesn't shrink its capacity automatically, it's generally fine for performance-critical buffers that tend to grow or stabilize at a certain size. Just confirming this is the intended behavior and not a concern for this specific use case.

LhsPackedData = LhsPacked.get();

std::unique_ptr<std::byte[]> RhsPacked{nullptr};
if (g_kai_tls.lhs_packed.capacity() < LhsPackedStride * BatchSize) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any measure to avoid LhsPackedStride * BatchSize having integer overflow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a fair point, I can add some checks before resizing to try protect against overflow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a new function to mlasi_kleidi.h which uses a gcc/clang built in for checking the wrap around for size_t, I also added a generic implementation for systems without the builtin.

@JonathanC-ARM
Copy link
Contributor Author

This PR introduces a good performance optimization by reusing memory buffers through thread_local storage. The approach seems sound, especially for reducing allocation overhead on platforms like Android, as highlighted in the description.

The use of std::vector::reserve and std::vector::resize for managing the thread-local buffers is appropriate for this kind of optimization.

One minor question regarding the lhs_packed and rhs_packed buffers:

if (g_kai_tls.lhs_packed.capacity() < LhsPackedStride * BatchSize) {
    g_kai_tls.lhs_packed.reserve(LhsPackedStride * BatchSize);
}
g_kai_tls.lhs_packed.resize(LhsPackedStride * BatchSize);

This logic ensures the capacity grows as needed. Is there any scenario where the LhsPackedStride * BatchSize could become significantly smaller than the current capacity, leading to a large amount of unused reserved memory? While std::vector doesn't shrink its capacity automatically, it's generally fine for performance-critical buffers that tend to grow or stabilize at a certain size. Just confirming this is the intended behavior and not a concern for this specific use case.

Yes, this is the intended behavior. You are correct that if the required buffer size later becomes smaller, the capacity may remain larger than strictly necessary, leading to some unused reserved memory. In practice, we expect the footprint to plateau based on the model/matrices under test, and the maximum allocation would be the same with or without this approach.

We deliberately avoid shrinking the vectors, since at this level we cannot predict future operations, and frequent shrink/grow cycles could be costly. The trade-off is slightly higher sustained memory usage, but with more predictable performance.

std::vector<std::byte> rhs_packed;
std::vector<std::byte> lhs_packed;
};
static thread_local KaiTlsBuffers g_kai_tls;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note that ORT as a library could not control when a thread_local variable is deallocated. As the destructor of this var does not depend on anything except the C++ runtime, I feel it's fine.

@snnn snnn closed this Sep 26, 2025
@snnn snnn reopened this Sep 26, 2025
if (mul_overflow_size_t_builtin(TileSizeM, TileSizeN, &tile_elems))
{
// size_t wraparound detected, exit
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing semi-colon?

also, this will just return early from the worker thread and not actually exit the function. perhaps this check should be moved before the MlasTrySimpleParallel call using the maximum possible values of TileSizeM and TileSizeN.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies, that semi-colon was a typo on my part.

I have moved that logic to outside the MlasTrySimpleParallel so we can return based on the max size

@JonathanC-ARM JonathanC-ARM requested review from edgchen1 and snnn October 1, 2025 09:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants