Skip to content

Conversation

jiafuzha
Copy link

@jiafuzha jiafuzha commented Sep 18, 2025

supported below four fbgemm ops on xpu.

fbgemm::asynchronous_complete_cumsum
fbgemm::jagged_to_padded_dense_forward
fbgemm::jagged_to_padded_dense
fbgemm::dense_to_jagged_forward
fbgemm::jagged_dense_elementwise_add_jagged_output

Please make sure you have below env vars set correctly for running the UT.

# make sure ONEAPI_ROOT is set since it's referenced in umf's vars.sh. Otherwise, you may not able to see any device.
export ONEAPI_ROOT=.../intel/oneapi/
# DPCPP 2025.3
source .../DPCPP/env/vars.sh
source ~/intel/oneapi/mkl/latest/env/vars.sh
source .../pti_0.12/env/vars.sh
source .../umf/1.0.2/env/vars.sh
export BUILD_SEPARATE_OPS=ON
export BUILD_WITH_CPU=ON
export TORCH_XPU_ARCH_LIST='pvc'
export USE_PTI=ON
export USE_KINETO=ON
export USE_XETLA=OFF

@jiafuzha jiafuzha changed the title fbgemm async complete cumsum op, jagged and dense conversion ops fbgemm async complete cumsum op, jagged and dense conversion ops, jagged_dense_elementwise_add_jagged_output op Sep 22, 2025
@jiafuzha jiafuzha changed the title fbgemm async complete cumsum op, jagged and dense conversion ops, jagged_dense_elementwise_add_jagged_output op fbgemm async complete cumsum op, jagged and dense conversion ops, jagged_dense_elementwise_add_jagged_output, reorder batched lengths and indices op Sep 23, 2025
@jiafuzha jiafuzha changed the title fbgemm async complete cumsum op, jagged and dense conversion ops, jagged_dense_elementwise_add_jagged_output, reorder batched lengths and indices op fbgemm async complete cumsum op, jagged and dense conversion ops, jagged_dense_elementwise_add_jagged_output, reorder ops and permute_2d_sparse_data op Sep 29, 2025
@jiafuzha
Copy link
Author

@majing921201 @fengyuan14 @gujinghui , All necessary fbgemm ops are supported in xpu. please help review.

const OptionalDeviceGuard device_guard(device_of(TENSOR));

Tensor asynchronous_complete_cumsum_xpu(const Tensor& t_in) {
TORCH_CHECK(t_in.is_contiguous());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the limitation in XPU implementation aligned with CUDA?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which limitation do you refer to?

namespace {

TORCH_LIBRARY(fbgemm, m) {
m.def("asynchronous_complete_cumsum(Tensor t_in) -> Tensor");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there conflict here? when FBGEMM define the same symbol?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it could be a problem. I'll verify and figure it out.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As verified, fbgemm ops schema cannot be registered here. Otherwise, it'll fail fbgemm lib. And it's not proper adding fbgemm lib dep here. My solution is to add schema definition in test_fbgemm_ops_xpu.py, like below,

`lib = torch.library.Library("fbgemm", "DEF")

lib.define("asynchronous_complete_cumsum(Tensor t_in) -> Tensor")
...`

Then, I can run all fbgemm ops test successfully.

In user side, they need to 'import fbgemm_gpu' as normal. Then they can use these xpu ops. Considering the schemas unlikely changing, it should be good for us to use official fbgemm's schema.

"permute_2D_sparse_data(Tensor permute, Tensor lengths, Tensor indices, Tensor? weights=None, int? permuted_lengths_sum=None) -> (Tensor, Tensor, Tensor?)");
}

TORCH_LIBRARY_IMPL(fbgemm, XPU, m) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use one TORCH_LIBRARY_IMPL to contain all impl definitions.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -0,0 +1,1122 @@
# Owner(s): ["module: intel"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not take CPU result as ref like PyTorch UT ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then, we need to install fbgemm cpu which is not desired.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you remove schema define as @fengyuan14 suggestion, then you must install fbgemm

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct. But it also means our torch-xpu-ops repo has dependency to fbgemm. sounds good?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my reply to yuanfeng's comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants