-
Notifications
You must be signed in to change notification settings - Fork 73
Description
Describe the bug
Hi team,
We received an end-to-end performance issue report from Llama3.1 users. They observed a performance drop when using the Inductor C++ wrapper (AOTInductor) compared to the Python wrapper.
The root cause is that, in the C++ wrapper, Inductor needs to launch the kernel directly (since Triton is not required in AOTInductor deploy mode). When launching the SPIR-V kernel compiled by Triton, a build_flag
is required for the Level Zero API zeModuleCreate
to indicate whether large GRF mode is enabled. However, Inductor currently has no visibility into this flag, which is determined by Triton. As a result, Inductor does not pass the correct build_flag
to L0, leading to a different binary kernel than the one Triton would build.
To address this, I suggest that Triton store the build_flag
in the metadata
of the CompiledKernel
object returned by tl.compile()
. This way, Inductor can retrieve and propagate the correct flag.
This is a critical performance issue for AOTInductor users and others relying on the C++ wrapper to reduce host overhead. We would greatly appreciate it if this feature request could be included in PyTorch 2.10.
Thanks
Environment details
None