Skip to content

Conversation

tqchen
Copy link
Member

@tqchen tqchen commented May 7, 2025

This PR introduces autodlpack feature to the tvm ffi. When an ffi Function takes Tensor argument that conforms to DLPack it automatically imports into NDArray and pass as argument.

The feature will allow compiled function to directly take torch.Tensor as input argument without extra set of changes.

We also added a benchmark script to measure the overall ffi overhead. One thing to note is that there is still continuguous and alignment requirement that is needed by underlying DSL compiler, as of now we use a global value. So x.continugous is still needed before passing the argument if tranpose or other ops are performed.

@tqchen tqchen changed the title [FFI][FEAT] AutoDLPack for taking external tensor objects. [FFI][FEAT] AutoDLPack for taking external tensor objects May 7, 2025
@tqchen
Copy link
Member Author

tqchen commented May 7, 2025

Benchmark

Env CPU: AMD Ryzen 9 7950X

> python ffi/scripts/benchmark_dlpack.py

-----------------------------
Benchmark f(x, y, z) overhead
-----------------------------
numpy.add                                1.921653747558594e-07 sec/call
torch.add[cpu]                           6.330013275146484e-07 sec/call
torch.add[cuda]                          2.330756187438965e-06 sec/call
tvm.ffi.nop                              3.983736038208008e-07 sec/call
tvm.ffi.nop+from_dlpack(torch)           4.368019104003906e-06 sec/call
tvm.ffi.nop+from_dlpack(numpy)           1.1694192886352538e-06 sec/call
tvm.ffi.nop+from_dlpack(tvm)             1.4580249786376954e-06 sec/call
tvm.ffi.nop+from_dlpack(torch.utils)     3.2754182815551756e-06 sec/call
tvm.ffi.nop.autodlpack(torch[cpu])       3.567361831665039e-06 sec/call
tvm.ffi.nop.autodlpack(torch[cuda])      3.5606861114501952e-06 sec/call
tvm.ffi.nop.autodlpack(numpy)            1.6696929931640624e-06 sec/call
-------------------------------
Benchmark x.__dlpack__ overhead
-------------------------------
torch.utils.dlpack.to_dlpack             4.5762062072753906e-07 sec/call
torch.__dlpack__                         9.840965270996094e-07 sec/call
numpy.__dlpack__                         5.011558532714844e-08 sec/call
tvm.__dlpack__                           1.5852451324462892e-07 sec/call
---------------------------------------------------
Benchmark x.__dlpack__(max_version=(1,1)) overhead
---------------------------------------------------
torch.__dlpack__(max_version=(1,1))      Tensor.__dlpack__() got an unexpected keyword 'max_version'
numpy.__dlpack__(max_version=(1,1))      6.172657012939454e-08 sec/call
tvm.__dlpack__(max_version=(1,1))        1.720428466796875e-07 sec/call

Discussions

  • First, we can see that the overall FFI overhead of python c++ is roughly at 0.2us -3us level
    • Notably, each torch.add eager call in cuda is around 2.4 us
  • AutoDLPack as of now can get to about 3.6us for a call of f(x, y, z) that needs three import calls, which aligns reasonably well with the torch eager cuda overhead.
  • One can observe that torch.__dlpack__ overhead is larger than numpy.__dlpack__
    • torch.__dlpack__ could use some improvement, tvm.__dlpack__ is based on a c++ impl and likely can provide some estimate on what can be done
  • AutoDLPack from numpy arguments have about 1.6us overhead

This PR introduces autodlpack feature to the tvm ffi.
When an ffi Function takes Tensor argument that conforms to DLPack
it automatically imports into NDArray and pass as argument.

The feature will allow compiled function to directly take torch.Tensor
as input argument without extra set of changes. When a function returns
NDArray, the return value still needs to be converted back via torch.from_dlpack.

However, a common use case is the destination passing, where all inputs
outputs are pre-allocated and passed into the function. AutoDLPack effectively
enables zero overhead support for a wide range of python arrays.

We also added a benchmark script to measure the overall ffi overhead.
One thing to note is that there is still continuguous and alignment
requirement that is needed by underlying DSL compiler, as of now
we use a global value. So x.continugous is still needed before passing
the argument if tranpose or other ops are performed.
@tqchen
Copy link
Member Author

tqchen commented May 7, 2025

@Hzfengsy Hzfengsy merged commit da6d510 into apache:main May 8, 2025
13 checks passed
ShiboXing pushed a commit to ShiboXing/tvm that referenced this pull request Aug 10, 2025
[FFI][FEAT] AutoDLPack to enable external tensor args.

This PR introduces autodlpack feature to the tvm ffi.
When an ffi Function takes Tensor argument that conforms to DLPack
it automatically imports into NDArray and pass as argument.

The feature will allow compiled function to directly take torch.Tensor
as input argument without extra set of changes. When a function returns
NDArray, the return value still needs to be converted back via torch.from_dlpack.

However, a common use case is the destination passing, where all inputs
outputs are pre-allocated and passed into the function. AutoDLPack effectively
enables zero overhead support for a wide range of python arrays.

We also added a benchmark script to measure the overall ffi overhead.
One thing to note is that there is still continuguous and alignment
requirement that is needed by underlying DSL compiler, as of now
we use a global value. So x.continugous is still needed before passing
the argument if tranpose or other ops are performed.
tqchen added a commit to tqchen/tvm that referenced this pull request Sep 13, 2025
[FFI][FEAT] AutoDLPack to enable external tensor args.

This PR introduces autodlpack feature to the tvm ffi.
When an ffi Function takes Tensor argument that conforms to DLPack
it automatically imports into NDArray and pass as argument.

The feature will allow compiled function to directly take torch.Tensor
as input argument without extra set of changes. When a function returns
NDArray, the return value still needs to be converted back via torch.from_dlpack.

However, a common use case is the destination passing, where all inputs
outputs are pre-allocated and passed into the function. AutoDLPack effectively
enables zero overhead support for a wide range of python arrays.

We also added a benchmark script to measure the overall ffi overhead.
One thing to note is that there is still continuguous and alignment
requirement that is needed by underlying DSL compiler, as of now
we use a global value. So x.continugous is still needed before passing
the argument if tranpose or other ops are performed.
tqchen added a commit to tqchen/tvm that referenced this pull request Sep 13, 2025
[FFI][FEAT] AutoDLPack to enable external tensor args.

This PR introduces autodlpack feature to the tvm ffi.
When an ffi Function takes Tensor argument that conforms to DLPack
it automatically imports into NDArray and pass as argument.

The feature will allow compiled function to directly take torch.Tensor
as input argument without extra set of changes. When a function returns
NDArray, the return value still needs to be converted back via torch.from_dlpack.

However, a common use case is the destination passing, where all inputs
outputs are pre-allocated and passed into the function. AutoDLPack effectively
enables zero overhead support for a wide range of python arrays.

We also added a benchmark script to measure the overall ffi overhead.
One thing to note is that there is still continuguous and alignment
requirement that is needed by underlying DSL compiler, as of now
we use a global value. So x.continugous is still needed before passing
the argument if tranpose or other ops are performed.
tqchen added a commit to tqchen/tvm that referenced this pull request Sep 13, 2025
[FFI][FEAT] AutoDLPack to enable external tensor args.

This PR introduces autodlpack feature to the tvm ffi.
When an ffi Function takes Tensor argument that conforms to DLPack
it automatically imports into NDArray and pass as argument.

The feature will allow compiled function to directly take torch.Tensor
as input argument without extra set of changes. When a function returns
NDArray, the return value still needs to be converted back via torch.from_dlpack.

However, a common use case is the destination passing, where all inputs
outputs are pre-allocated and passed into the function. AutoDLPack effectively
enables zero overhead support for a wide range of python arrays.

We also added a benchmark script to measure the overall ffi overhead.
One thing to note is that there is still continuguous and alignment
requirement that is needed by underlying DSL compiler, as of now
we use a global value. So x.continugous is still needed before passing
the argument if tranpose or other ops are performed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants