Skip to content

Conversation

TianDi101
Copy link
Collaborator

No description provided.

@TianDi101
Copy link
Collaborator Author

TianDi101 commented Sep 25, 2025

This PR introduce MORI-EP V1 dispatch kernel, we've seen some early results showing significant performance improvement. Below is tested with DeepSeekV3 config:

FP8 Dispatch: RDMA best 39GB/s, avg 35.8GB/s ; XGMI best 127GB/s, avg 118GB/s
BF16 Dispatch: RDMA best 43.5GB/s, avg 40.5GB/s; XGMI best 140GB/s, avg 133GB/s

RDMA bandwidth is calculated in the same way DeepEP does

FP8 Dispatch
image

BF16 Dispatch
image

@TianDi101
Copy link
Collaborator Author

TianDi101 commented Sep 26, 2025

Perf now boost further with channel design.

FP8 Dispatch: RDMA best 56 GB/s, avg 49 GB/s ; XGMI best 183 GB/s, avg 164 GB/s
BF16 Dispatch: RDMA best 65 GB/s, avg 56 GB/s; XGMI best 212 GB/s, avg 194GB/s

FP8 Dispatch
image

BF16 Dispatch
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant