Skip to content

Conversation

UNIDY2002
Copy link

@UNIDY2002 UNIDY2002 commented Sep 14, 2025

Motivation

Why do we need a new backend?

The Mooncake Backend is a collective communication backend that provides fault tolerance while remaining seamlessly compatible with PyTorch. To the best of our knowledge, this is the first backend that supports fault tolerance for both:

  • torch.distributed primitives, and
  • EP all-to-all primitives.

This lays the foundation for fault-tolerant distributed MoE inference (#8961).

Modifications

The Mooncake Backend and Mooncake EP can be enabled via two options:

Mooncake Backend

  • Set --dist-backend mooncake.
  • This replaces NCCL, Gloo, etc. used by torch.distributed with Mooncake.
  • Implemented by registering "mooncake" as a custom backend in torch.distributed.init_process_group.
  • Minimal code changes are required (non-intrusive integration).

Mooncake EP

  • Set --moe-a2a-backend mooncake.
  • This replaces the DeepEP all-to-all backend with Mooncake EP.
  • Implemented by adding a new class MooncakeEPDispatcher in token_dispatcher/mooncake.py.

Accuracy Tests

  • Added a new unittest: ep/test_mooncake_ep_small.py to validate correctness of the new features.
  • All existing unittests remain unaffected and should continue to pass.

Caveat:

The Mooncake package with the new backend support will be published soon.

For now, you may download the wheel named with "+ep" from the CI artifacts of kvcache-ai/Mooncake#805.

Benchmarking and Profiling

According to our unit tests, enabling the Mooncake Backend and Mooncake EP may slightly decrease performance. This is expected, since the Mooncake Backend is not yet fully optimized; optimization work will be carried out separately.

We compared test_mooncake_ep_small.py with test_deepep_small.py.

Throughput (token/s):

TestCase test_mooncake_ep_small.py test_deepep_small.py
TestPureDP 1581.135 2011.321
TestHybridDPTP 1324.707 1773.038
TestTP 1484.145 1668.714

Latency (s):

TestCase test_mooncake_ep_small.py test_deepep_small.py
TestPureDP 11.561 9.063
TestHybridDPTP 13.639 10.206
TestTP 12.122 10.523

pd disaggregation use

We conducted an end-to-end test of this pr in a PD separation scenario on two H20-3e

prefill

GLOO_SOCKET_IFNAME=eth0 NCCL_IB_HCA=mlx5_ NCCL_IB_DISABLE=0 NCCL_SOCKET_IFNAME=eth0 CCL_IB_GID_INDEX=3 SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK=1 NCCL_MIN_NCHANNELS=24 NCCL_IB_QPS_PER_CONNECTION=8 SGL_ENABLE_JIT_DEEPGEMM=1 python3 -m sglang.launch_server --disaggregation-ib-device  "mlx5_1,mlx5_2,mlx5_3,mlx5_4" --model-path /data/models/DeepSeek-R1-0528/ --tp 8 --disaggregation-mode prefill  --host 0.0.0.0 --port 30300 --moe-a2a-backend deepep --deepep-mode normal --disable-radix-cache --max-running-requests 16 --moe-dense-tp-size 1 --chunked-prefill-size 0 --trust-remote-code --watchdog-timeout 1000000  --enable-dp-attention --dp-size 4  --mem-fraction-static 0.8 --show-time-cost  --enable-dp-lm-head --page-size 64 --enable-eplb --eplb-rebalance-num-iterations 10000 --load-balance-method round_robin

decode

GLOO_SOCKET_IFNAME=eth0 NCCL_IB_DISABLE=0 NCCL_SOCKET_IFNAME=eth0 SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK=1 NCCL_MIN_NCHANNELS=24 NCCL_IB_QPS_PER_CONNECTION=8  SGL_ENABLE_JIT_DEEPGEMM=1 SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK=1 MC_CUSTOM_TOPO_JSON=/sgl-workspace/nic_priority_matrix.json MC_WORKERS_PER_CTX=1 python3 -m sglang.launch_server --model-path /data/models/DeepSeek-R1-0528/ --tp 8 --disaggregation-mode decode  --disaggregation-ib-device  "mlx5_1,mlx5_2,mlx5_3,mlx5_4" --host 0.0.0.0 --port 30300  --moe-a2a-backend mooncake   --disable-radix-cache --mem-fraction-static 0.8 --max-running-requests 1024  --moe-dense-tp-size 1 --cuda-graph-bs 32  --watchdog-timeout 1000000  --enable-dp-attention --dp-size 8 --trust-remote-code --show-time-cost --enable-dp-lm-head --nnodes 1 --node-rank 0  --enable-metrics --decode-log-interval 1 --enable-request-time-stats-logging --warmups 10 --log-level debug --log-requests --log-requests-level 3 --prefill-round-robin-balance --dist-backend mooncake

router

python3 -m sglang_router.launch_router --host 0.0.0.0 --port 8000 --pd-disaggregation --prefill-policy cache_aware --decode-policy round_robin --service-discovery --service-discovery-port 30300 --prefill-selector role=leader component=prefill --decode-selector role=leader component=decode --service-discovery-namespace default --request-timeout-secs 600 --max-concurrent-requests 1024 --model-path /data/models/DeepSeek-R1-0528/

simpole curl

curl http://127.0.0.1:8000/v1/chat/completions -H "Content-Type: application/json" -d '{"model": "/data/models/DeepSeek-R1-0528", "temperature": 0, "messages": [ {"role": "user", "content": "做一份北京出行攻略"}], "max_tokens": 300}'

{"id":"3f890c4a2b5445c0aabffdab05469787","object":"chat.completion","created":1757923786,"model":"/data/models/DeepSeek-R1-0528","choices":[{"index":0,"message":{"role":"assistant","content":"<think>\n嗯,用户想要一份北京出行攻略。这个需求很常见,但需要仔细考虑用户可能的背景和需求深度。用户没有说明旅行时长、预算、同行人员或兴趣偏好,所以攻略需要保持通用性,同时覆盖不同人群的可能需求。\n\n北京作为超大型旅游城市,攻略必须分层设计。首先考虑经典路线,这是大多数首次游客的核心需求。故宫、长城、天安门这些标志性景点必须包含,但要注意提醒用户提前预约——最近几年北京热门景点不预约根本进不去,很多外地游客会忽略这点。\n\n交通部分要特别强调地铁的便利性。北京地面交通太拥堵,地铁虽然拥挤但准时,尤其是1号线、2号线这些经典线路。不过也要提醒避开早晚高峰,不然用户体验会很差。机场快轨和北京南站到市区的方式也得写清楚,这是游客刚落地最需要的实用信息。\n\n美食推荐要平衡名气和本地特色。全聚德虽然被说商业化,但第一次来的人总想尝尝;同时必须加入尹三豆汁、门框卤煮这种地道小吃,但得标注清楚“豆汁慎点”——很多人喝不惯。四季民福烤鸭比全聚德性价比高,这个提示对预算有限的游客很实用。\n\n考虑到不同游客类型,补充了亲子、文化等主题路线。带孩子的家庭肯定需要环球影城和科技馆信息;文艺青年可能更喜欢798和胡同。还想到最近流行的","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"length","matched_stop":null}],"usage":{"prompt_tokens":8,"total_tokens":308,"completion_tokens":300,"prompt_tokens_details":null,"reasoning_tokens":0},"metadata":{"weight_version":"default"}}

Next Steps

This PR establishes the foundation by introducing the Mooncake Backend and Mooncake EP. The following steps will build upon it:

  • Integrate with EPLB: Adapt the EPLB algorithm to avoid mapping experts to faulty ranks, using Mooncake’s fault-awareness.
  • Expert Redistribution: Trigger expert rebalancing when new rank failures occur, ensuring that every expert has at least one healthy replica.
  • Scheduler Enhancements: Make the scheduler fault-aware, so failed requests can be retried or rerouted to healthy ranks automatically.
  • Stability and Performance Testing: Broaden testing coverage with larger TP sizes, prefill-decode disaggregation, and multiple-rank failure scenarios.

These will be addressed in subsequent PRs as outlined in the milestone roadmap in #8961 (comment).

Checklist

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @UNIDY2002, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request lays the foundational groundwork for fault-tolerant distributed Mixture-of-Experts (MoE) inference by integrating the new Mooncake communication backend. It introduces Mooncake as a drop-in replacement for torch.distributed primitives and as an alternative all-to-all backend for MoE layers, designed to provide fault tolerance. While current benchmarks indicate a slight performance decrease, this is expected as optimization work is planned for future iterations. This PR is a crucial first step towards building a more robust and resilient distributed inference system.

Highlights

  • New Communication Backend: Introduced the Mooncake Backend to provide fault tolerance for torch.distributed primitives, offering an alternative to NCCL, Gloo, etc.
  • Mooncake Expert Parallelism (EP): Implemented Mooncake EP as a new all-to-all backend for Mixture-of-Experts (MoE) layers, replacing the existing DeepEP backend with a fault-tolerant solution.
  • Non-Intrusive Integration: The Mooncake Backend integrates seamlessly with torch.distributed.init_process_group by registering 'mooncake' as a custom backend, requiring minimal code changes.
  • New Dispatcher Implementation: A new MooncakeEPDispatcher class has been added, providing the core logic for Mooncake EP's dispatch and combine operations, including handling of broken ranks.
  • Expanded Configuration Options: New command-line arguments --dist-backend mooncake and --moe-a2a-backend mooncake are available to enable the new fault-tolerant features.
  • New Unit Tests: A comprehensive new unit test file (test_mooncake_ep_small.py) has been added to validate the correctness of the Mooncake Backend and Mooncake EP across various distributed configurations (Pure DP, Hybrid DPTP, TP, TBO).
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the Mooncake backend and Mooncake EP to support fault-tolerant distributed inference. The changes are well-structured and integrate the new backend into the existing system. I've added a few comments to improve code maintainability and user experience, such as refining error messages, renaming confusing variables, and suggesting refactoring for duplicated code. The addition of new tests for Mooncake EP is also a great step towards ensuring correctness.

Comment on lines +674 to +678
if self.moe_a2a_backend == "mooncake":
self.ep_size = self.tp_size
logger.warning(
f"Mooncake MoE is enabled. The expert parallel size is adjusted to be the same as the tensor parallel size[{self.tp_size}]."
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This block is very similar to the one for deepep on lines 665-669. Consider refactoring them into a single block to avoid code duplication, for example: if self.moe_a2a_backend in ["deepep", "mooncake"]:.

@whybeyoung
Copy link
Collaborator

LGTM

Comment on lines +68 to +70
def get_ep_active_ranks() -> torch.Tensor:
assert _ACTIVE_RANKS is not None, "_ACTIVE_RANKS is not initialized"
return _ACTIVE_RANKS
Copy link
Collaborator

@ShangmingCai ShangmingCai Sep 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any place that uses get_ep_active_ranks this util, and _ACTIVE_RANKS hasn't been updated after initialization (L156-L159), so why not just use self.active_ranks? Is this left for future usage?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. In the following PRs, _ACTIVE_RANKS will be used by the model_runner to know whether a rank has been broken and determine whether to trigger a redistribution of expert weights.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants