Skip to content

Conversation

jslap-ubi
Copy link

Description

Add compile flags to disable some SIMD optimizations

Motivation and Context

There is a mechanism to detect at runtime if some SIMD optimizations are available, and skip them if it is not the case, But on some platform, some compiler SIMD optimization are not available, even at the compilation step.

@jslap-ubi jslap-ubi requested a review from a team as a code owner April 5, 2024 20:49
@jslap-ubi
Copy link
Author

@microsoft-github-policy-service agree company="Ubisoft"


#if defined(MLAS_TARGET_AMD64)

#if !defined(ORT_DISABLE_AVX2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do these macros come from?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to follow the way enabling/disabling certain things was done in the code (for instance: ORT_ENABLE_STREAM). Then the macros should be defined with a '-D CMAKE_CXX_FLAGS=ORT_DISABLE_AVX2'.

Would there be a preferred way of doing this? Should it be done through cmake option() ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be preferable to make them CMake options. They should be documented too.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure? I don't want to overcrowd the cmake options. That would be 5 more options.
If you're positive, I will add the options and documentation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they should be documented. it's fine to add additional CMake options. these are additional build configuration options so it seems like a reasonable place for them.

@yufenglee
Copy link
Member

The instruction set will be chosen dynamically at runtime. What is the reason to add those macros? Could you please explain more about it?

@jslap-ubi
Copy link
Author

Some SDKs just don't have the include files needed for the intrinsic. So making it optional would allow more platforms to be able to use the library.

@yufenglee
Copy link
Member

Some SDKs just don't have the include files needed for the intrinsic. So making it optional would allow more platforms to be able to use the library.

Which SDKs don't have the intrinsic files?

@jslap-ubi
Copy link
Author

There is one of our gaming platform that links inclusion of the files containing the symbols for the intrinsics (__m512 for instance) with the compiler actually generating code with the instruction set (-mavx512 for instance).
And we would like to have the flexibility to use the compile flag or not.

@jslap-ubi
Copy link
Author

Have you had the chance to review with my last comments?

@jslap-ubi
Copy link
Author

Bump. Just to make sure we don't forget about this one.

@jslap-ubi
Copy link
Author

Any news on this? @yufenglee ?

@yufenglee
Copy link
Member

/azp run Windows ARM64 QNN CI Pipeline,Windows x64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline

@yufenglee
Copy link
Member

/azp run Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-amd-gpu-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,onnxruntime-binary-size-checks-ci-pipeline,Big Models,Android CI Pipeline

@yufenglee
Copy link
Member

/azp run iOS CI Pipeline,ONNX Runtime React Native CI Pipeline

Copy link

Azure Pipelines successfully started running 2 pipeline(s).

Copy link

Azure Pipelines successfully started running 9 pipeline(s).

Copy link

Azure Pipelines successfully started running 10 pipeline(s).

yufenglee
yufenglee previously approved these changes Aug 1, 2024
Copy link
Member

@yufenglee yufenglee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@yufenglee
Copy link
Member

/azp run Linux Android Emulator QNN CI Pipeline

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@yufenglee
Copy link
Member

/azp run Python format

Copy link

No pipelines are associated with this pull request.

@yufenglee
Copy link
Member

@jslap-ubi, you need to sync to latest main branch to trigger the Python format CI.

@jslap-ubi jslap-ubi force-pushed the js/allow-disable-SIMD branch from 2f67b8f to 46f8996 Compare August 1, 2024 19:49
@yufenglee
Copy link
Member

/azp run Windows ARM64 QNN CI Pipeline,Windows x64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline

@yufenglee
Copy link
Member

/azp run Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-amd-gpu-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,onnxruntime-binary-size-checks-ci-pipeline,Big Models,Android CI Pipeline

Copy link

Azure Pipelines successfully started running 9 pipeline(s).

Copy link

Azure Pipelines successfully started running 10 pipeline(s).

@yufenglee
Copy link
Member

/azp run Linux Android Emulator QNN CI Pipeline

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@jslap-ubi
Copy link
Author

Thanks for the approval @yufenglee , what is the next step for this?

@jslap-ubi
Copy link
Author

Is there anything else I need to do to go forward with the merge? @yufenglee

@jslap-ubi
Copy link
Author

I just rebase on main since there was now a file conflicting.

@yufenglee : I think we need to renew the approval. Also, What would then be missing for having this MR merged?

@jslap-ubi jslap-ubi force-pushed the js/allow-disable-SIMD branch from d0aada7 to 7ca7306 Compare November 29, 2024 21:54
@jslap-ubi
Copy link
Author

The MR has been rebased, it would be ready to merge.

@yufenglee @yihonglyu : Could it be possible to re-approve this?


#if defined(MLAS_TARGET_AMD64)

#if !defined(ORT_DISABLE_AVX2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be preferable to make them CMake options. They should be documented too.

}

#ifndef __APPLE__
#if !defined(__APPLE__) && !defined(ORT_DISABLE_AMX)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This !defined(ORT_DISABLE_AMX) block also contains this AVX kernel assignment.

https://github.com/jslap-ubi/onnxruntime/blob/7ca7306ae5b65a9606e57ca98d730ae438f568d1/onnxruntime/core/mlas/lib/platform.cpp#L510-L512

Why is this guarded by the ORT_DISABLE_AMX macro?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I think it happened when I rebased. I will fix it.

@edgchen1
Copy link
Contributor

There is a mechanism to detect at runtime if some SIMD optimizations are available, and skip them if it is not the case, But on some platform, some compiler SIMD optimization are not available, even at the compilation step.

Do you want to skip compilation of certain optimized kernels? It looks like this change only prevents them from being selected at runtime.

Would you mind elaborating on your use case?

@jslap-ubi
Copy link
Author

Would you mind elaborating on your use case?

On some gaming platform, compiling the kernels that target those architecture will just fail bacause the intrinsics are not present at all. So we need to skip the compilation of these specific kernel, and to skip where these function pointers are used.

@edgchen1
Copy link
Contributor

Would you mind elaborating on your use case?

On some gaming platform, compiling the kernels that target those architecture will just fail bacause the intrinsics are not present at all. So we need to skip the compilation of these specific kernel, and to skip where these function pointers are used.

I see. Does this change work as-is? I would expect that you'd also need to disable the actual compilation of the kernel source files elsewhere, like in onnxruntime_mlas.cmake.

@jslap-ubi
Copy link
Author

I see. Does this change work as-is? I would expect that you'd also need to disable the actual compilation of the kernel source files elsewhere, like in onnxruntime_mlas.cmake.

Yes, the actual kernel is located in the file qgemm_kernel_amx.cpp. (we do exclude this file from the build on our side), So I think the exclusion of the files should be integrated in the cmake files, right?

@edgchen1
Copy link
Contributor

So I think the exclusion of the files should be integrated in the cmake files, right?

Yes, that seems reasonable.

@jslap-ubi jslap-ubi force-pushed the js/allow-disable-SIMD branch from 7908f8b to 4568d96 Compare September 15, 2025 20:14
@jslap-ubi
Copy link
Author

I did the fixes in the cmake files, and I rebased.

set_source_files_properties(${mlas_platform_srcs_sse2} PROPERTIES COMPILE_FLAGS "-msse2")

set(mlas_platform_srcs_sse41
${MLAS_SRC_DIR}/qgemm_kernel_sse41.cpp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we adding this file here?

FYI, there's a comment saying it's only for Windows:

// N.B. MSVC does not require turning on SSE 4.1 intrinsics and the current use
// for this code is Windows only, so restrict this kernel to that environment.

and also this ifdef checking for _MSC_VER, which as far as I know implies Windows:

#if defined(_MSC_VER)
//
// Check if the processor supports SSE 4.1 instructions.
//
#ifndef FORCE_GENERIC_ALGORITHMS
if ((Cpuid1[2] & 0x80000) != 0) {
#else // FORCE_GENERIC_ALGORITHMS
if (false) {
#endif // FORCE_GENERIC_ALGORITHMS
this->GemmU8S8Dispatch = &MlasGemmU8S8DispatchSse41;
}
#endif

Copy link

@cgaudreau-ubisoft cgaudreau-ubisoft Sep 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This flag is necessary for compiling with clang on windows. clang on windows defines _MSC_VER but won't have the necessary sse41 symbols for the gemm op.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah I see, I didn't realize you all were using Clang on Windows. I'm not sure how much we've tested with that.

one question - would a check with check_cxx_compiler_flag (e.g., check_cxx_compiler_flag("-msse4.1" HAS_INTEL_SSE_4_1)) be sufficient to determine whether the file can be compiled in your particular case? that could be a more automatic way to exclude files requiring specific instruction support which may be unavailable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this particular change would also enable compilation of qgemm_kernel_sse41.cpp on non-Windows platforms. that may be fine, but it would be good to verify that first. for this PR, it may be simpler to only include qgemm_kernel_sse41.cpp in setup_mlas_source_for_windows().

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking the compiler flag might work, but I agree it's probably simpler for this PR just to limit the inclusion to the setup for windows -- that's what I've done in the most recent version.

)

if (NOT onnxruntime_ORT_MINIMAL_BUILD)
if (NOT onnxruntime_DISABLE_SSE4)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not the only place which would need to be modified. see also setup_mlas_source_for_windows():

function(setup_mlas_source_for_windows)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe I've updated all locations where these checks need to be done.

endif()

if (onnxruntime_DISABLE_SSE4)
add_compile_definitions(DISABLE_SSE4)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we limit the definition's scope to the onnxruntime_mlas target?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

endif()

if (onnxruntime_DISABLE_AVX)
add_compile_definitions(DISABLE_AVX)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these defines don't match up with the macros in the code. e.g., ORT_DISABLE_AVX. please verify that they work as expected. we don't have CI builds which test these options.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. Sorry. I tested it this morning. Will fix it.

@cgaudreau-ubisoft
Copy link

Hi! Just a heads up, I will be taking over for @jslap-ubi on the PR. I'm looking into your comments and getting caught up, I hope to address them early next week.

Comment on lines 8 to 15
# Usually, SIMD instructions in kernel compile fine, and we detect at run-time if they are supported
# But on some platforms, even compiling these SIMD specific instructions could fail.
# So we provide options to disable compiling these SIMD instructions.
option(onnxruntime_DISABLE_SSE4 "Disable compiling kernel with SSE4 instructions" OFF)
option(onnxruntime_DISABLE_AVX "Disable compiling kernel with AVX instructions" OFF)
option(onnxruntime_DISABLE_AVX2 "Disable compiling kernel with AVX2 instructions" OFF)
option(onnxruntime_DISABLE_AVX512 "Disable compiling kernel with AVX512 instructions" OFF)
option(onnxruntime_DISABLE_AMX "Disable compiling kernel with AMX instructions" OFF)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we keep these option definitions next to the others in cmake/CMakeLists.txt?

option(onnxruntime_DISABLE_AMX "Disable compiling kernel with AMX instructions" OFF)

if (onnxruntime_DISABLE_SSE4)
add_compile_definitions(ORT_DISABLE_SSE4)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please limit the definitions to the onnxruntime_mlas target, e.g., with target_compile_definitions(onnxruntime_mlas PRIVATE ...)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


if (NOT onnxruntime_DISABLE_SSE4)
target_sources(onnxruntime_mlas PRIVATE
${mlas_platform_srcs_sse41}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will mlas_platform_srcs_sse41 have been set to anything here?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, it wouldn't have been set to anything. Fixed.

set_source_files_properties(${mlas_platform_srcs_sse2} PROPERTIES COMPILE_FLAGS "-msse2")

set(mlas_platform_srcs_sse41
${MLAS_SRC_DIR}/qgemm_kernel_sse41.cpp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah I see, I didn't realize you all were using Clang on Windows. I'm not sure how much we've tested with that.

one question - would a check with check_cxx_compiler_flag (e.g., check_cxx_compiler_flag("-msse4.1" HAS_INTEL_SSE_4_1)) be sufficient to determine whether the file can be compiled in your particular case? that could be a more automatic way to exclude files requiring specific instruction support which may be unavailable.

set_source_files_properties(${mlas_platform_srcs_sse2} PROPERTIES COMPILE_FLAGS "-msse2")

set(mlas_platform_srcs_sse41
${MLAS_SRC_DIR}/qgemm_kernel_sse41.cpp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this particular change would also enable compilation of qgemm_kernel_sse41.cpp on non-Windows platforms. that may be fine, but it would be good to verify that first. for this PR, it may be simpler to only include qgemm_kernel_sse41.cpp in setup_mlas_source_for_windows().

Comment on lines 546 to 549
#endif // !defined(ORT_DISABLE_AVX2)

#endif // MLAS_TARGET_AMD64
#endif // !defined(ORT_DISABLE_AVX)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently there is a hierarchy of the code disabled by these macros. e.g., ORT_DISABLE_AVX -> ORT_DISABLE_AVX2 -> ORT_DISABLE_AVX512, where if ORT_DISABLE_AVX is defined, the code controlled by ORT_DISABLE_AVX512 will also be disabled.

this hierarchy is not reflected in the CMake logic that specifies what files are compiled. could it be made consistent?

e.g., flatten the ifdefs here so that they only surround the assignments of kernels that require the corresponding instruction support.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

@edgchen1 edgchen1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you elaborate a bit more on your use case(s)? what compilers are you using and what errors do you encounter?

I wonder if we can improve our automatic configuration instead of needing to add options to disable things manually. but it's not yet obvious to me what we're trying to fix here.

set(mlas_platform_srcs_sse41
${MLAS_SRC_DIR}/qgemm_kernel_sse41.cpp
)
set_source_files_properties(${mlas_platform_srcs_sse41} PROPERTIES COMPILE_FLAGS "-msse4.1")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-msse4.1 won't work for MSVC, right? that is the compiler for which setup_mlas_source_for_windows() will be called.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed this, you're right this doesn't make sense for MSVC.


if (((Cpuid1[2] & 0x1000) != 0) && ((Cpuid7[1] & 0x20) != 0)) {

#if !defined(ORT_DISABLE_AVX2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this ifdef block also includes FMA3 kernels. should those also be controlled by the macro to disable AVX2?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I've re-grouped the FMA3 kernels together and moved them outside the block.

@cgaudreau-ubisoft
Copy link

cgaudreau-ubisoft commented Oct 1, 2025

#20215 (comment)

encounter

I apologies for the lack of clarity (and delay, I was on holiday). We have a case where we need to compile the project using clang on windows with only SSE2 instructions and not SSE4 (this is to support older machines).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants