Reworked sgemm_kleidi memory allocations to reuse memory buffers #26166

JonathanC-ARM · 2025-09-25T22:43:00Z

Key changes

This PR makes changes to KleidiAI integration within the existing sgemm_kleidiai.cpp implementation.

It was noted that during internal testing that memory allocation overhead due to repeated allocations of vectors was having a negative impact on performance figures.

The changes introduce thread local buffers for reusing memory during inference.

Android platforms are particularly sensitive to this, we have observed inference times being significantly impacted due to memory allocation overheads

Example performance

All runs were captured using onnxruntime_perf_test
e.g. onnxruntime_perf_test -v -e cpu -I -m times -x 1 -y 1 -r 1
Android Platform

In addition to this on M4 we have also observed slight improvements on models, however its the gain is not as significant as the allocation overhead is lower in terms of total time on that platform

Mac Mini M4

…storage Signed-off-by: Jonathan Clohessy <[email protected]>

snnn

This PR introduces a good performance optimization by reusing memory buffers through thread_local storage. The approach seems sound, especially for reducing allocation overhead on platforms like Android, as highlighted in the description.

The use of std::vector::reserve and std::vector::resize for managing the thread-local buffers is appropriate for this kind of optimization.

One minor question regarding the lhs_packed and rhs_packed buffers:

if (g_kai_tls.lhs_packed.capacity() < LhsPackedStride * BatchSize) {
    g_kai_tls.lhs_packed.reserve(LhsPackedStride * BatchSize);
}
g_kai_tls.lhs_packed.resize(LhsPackedStride * BatchSize);

This logic ensures the capacity grows as needed. Is there any scenario where the LhsPackedStride * BatchSize could become significantly smaller than the current capacity, leading to a large amount of unused reserved memory? While std::vector doesn't shrink its capacity automatically, it's generally fine for performance-critical buffers that tend to grow or stabilize at a certain size. Just confirming this is the intended behavior and not a concern for this specific use case.

snnn · 2025-09-26T07:35:34Z

onnxruntime/core/mlas/lib/kleidiai/sgemm_kleidiai.cpp

-    LhsPackedData = LhsPacked.get();

-    std::unique_ptr<std::byte[]> RhsPacked{nullptr};
+    if (g_kai_tls.lhs_packed.capacity() < LhsPackedStride * BatchSize) {


Is there any measure to avoid LhsPackedStride * BatchSize having integer overflow?

That is a fair point, I can add some checks before resizing to try protect against overflow.

I've added a new function to mlasi_kleidi.h which uses a gcc/clang built in for checking the wrap around for size_t, I also added a generic implementation for systems without the builtin.

JonathanC-ARM · 2025-09-26T08:28:48Z

This PR introduces a good performance optimization by reusing memory buffers through thread_local storage. The approach seems sound, especially for reducing allocation overhead on platforms like Android, as highlighted in the description.

The use of std::vector::reserve and std::vector::resize for managing the thread-local buffers is appropriate for this kind of optimization.

One minor question regarding the lhs_packed and rhs_packed buffers:
if (g_kai_tls.lhs_packed.capacity() < LhsPackedStride * BatchSize) {
    g_kai_tls.lhs_packed.reserve(LhsPackedStride * BatchSize);
}
g_kai_tls.lhs_packed.resize(LhsPackedStride * BatchSize);
This logic ensures the capacity grows as needed. Is there any scenario where the LhsPackedStride * BatchSize could become significantly smaller than the current capacity, leading to a large amount of unused reserved memory? While std::vector doesn't shrink its capacity automatically, it's generally fine for performance-critical buffers that tend to grow or stabilize at a certain size. Just confirming this is the intended behavior and not a concern for this specific use case.

Yes, this is the intended behavior. You are correct that if the required buffer size later becomes smaller, the capacity may remain larger than strictly necessary, leading to some unused reserved memory. In practice, we expect the footprint to plateau based on the model/matrices under test, and the maximum allocation would be the same with or without this approach.

We deliberately avoid shrinking the vectors, since at this level we cannot predict future operations, and frequent shrink/grow cycles could be costly. The trade-off is slightly higher sustained memory usage, but with more predictable performance.

snnn · 2025-09-26T15:41:09Z

onnxruntime/core/mlas/lib/kleidiai/sgemm_kleidiai.cpp

+    std::vector<std::byte> rhs_packed;
+    std::vector<std::byte> lhs_packed;
+};
+static thread_local KaiTlsBuffers g_kai_tls;


Please note that ORT as a library could not control when a thread_local variable is deallocated. As the destructor of this var does not depend on anything except the C++ runtime, I feel it's fine.

…hanges

…t throughout Sgemm Signed-off-by: Jonathan Clohessy <[email protected]>

edgchen1 · 2025-09-30T15:29:24Z

onnxruntime/core/mlas/lib/kleidiai/sgemm_kleidiai.cpp

+        if (mul_overflow_size_t_builtin(TileSizeM, TileSizeN, &tile_elems))
+        {
+            // size_t wraparound detected, exit
+            return


missing semi-colon?

also, this will just return early from the worker thread and not actually exit the function. perhaps this check should be moved before the MlasTrySimpleParallel call using the maximum possible values of TileSizeM and TileSizeN.

Apologies, that semi-colon was a typo on my part.

I have moved that logic to outside the MlasTrySimpleParallel so we can return based on the max size

Signed-off-by: Jonathan Clohessy <[email protected]>

Reworked memory allocations to reuse memory buffers via Thread local …

e87fcc0

…storage Signed-off-by: Jonathan Clohessy <[email protected]>

JonathanC-ARM changed the title ~~Reworked memory allocations in sgemm_kleidi reuse memory buffers~~ Reworked sgemm_kleidi memory allocations to reuse memory buffers Sep 25, 2025

snnn reviewed Sep 26, 2025

View reviewed changes

snnn closed this Sep 26, 2025

snnn reopened this Sep 26, 2025

JonathanC-ARM and others added 2 commits September 29, 2025 15:06

Merge branch 'microsoft:main' into jclohess_sgemm_memory_allocation_c…

1289b04

…hanges

Added function which checks for size_t wraparound condition and use i…

ef7c9f7

…t throughout Sgemm Signed-off-by: Jonathan Clohessy <[email protected]>

edgchen1 reviewed Sep 30, 2025

View reviewed changes

Check Multiply based on maximum possible tile size

82aa7fd

Signed-off-by: Jonathan Clohessy <[email protected]>

JonathanC-ARM requested review from edgchen1 and snnn October 1, 2025 09:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reworked sgemm_kleidi memory allocations to reuse memory buffers #26166

Reworked sgemm_kleidi memory allocations to reuse memory buffers #26166

JonathanC-ARM commented Sep 25, 2025

Uh oh!

snnn left a comment

Uh oh!

snnn Sep 26, 2025

Uh oh!

JonathanC-ARM Sep 26, 2025

Uh oh!

JonathanC-ARM Sep 30, 2025

Uh oh!

JonathanC-ARM commented Sep 26, 2025

Uh oh!

snnn Sep 26, 2025

Uh oh!

edgchen1 Sep 30, 2025

Uh oh!

JonathanC-ARM Sep 30, 2025

Uh oh!

Uh oh!

Reworked sgemm_kleidi memory allocations to reuse memory buffers #26166

Are you sure you want to change the base?

Reworked sgemm_kleidi memory allocations to reuse memory buffers #26166

Conversation

JonathanC-ARM commented Sep 25, 2025

Key changes

Example performance

Uh oh!

snnn left a comment

Choose a reason for hiding this comment

Uh oh!

snnn Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

JonathanC-ARM Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

JonathanC-ARM Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

JonathanC-ARM commented Sep 26, 2025

Uh oh!

snnn Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

edgchen1 Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

JonathanC-ARM Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!