[1/N] Introduce Mooncake Backend and Mooncake EP to Support Elastic EP #10423

UNIDY2002 · 2025-09-14T11:49:43Z

Motivation

Why do we need a new backend?

The Mooncake Backend is a collective communication backend that provides fault tolerance while remaining seamlessly compatible with PyTorch. To the best of our knowledge, this is the first backend that supports fault tolerance for both:

torch.distributed primitives, and
EP all-to-all primitives.

This lays the foundation for fault-tolerant distributed MoE inference (#8961).

Modifications

The Mooncake Backend and Mooncake EP can be enabled via two options:

Mooncake Backend

Set --dist-backend mooncake.
This replaces NCCL, Gloo, etc. used by torch.distributed with Mooncake.
Implemented by registering "mooncake" as a custom backend in torch.distributed.init_process_group.
Minimal code changes are required (non-intrusive integration).

Mooncake EP

Set --moe-a2a-backend mooncake.
This replaces the DeepEP all-to-all backend with Mooncake EP.
Implemented by adding a new class MooncakeEPDispatcher in token_dispatcher/mooncake.py.

Accuracy Tests

Added a new unittest: ep/test_mooncake_ep_small.py to validate correctness of the new features.
All existing unittests remain unaffected and should continue to pass.

Caveat:

The Mooncake package with the new backend support will be published soon.

For now, you may download the wheel named with "+ep" from the CI artifacts of kvcache-ai/Mooncake#805.

Benchmarking and Profiling

According to our unit tests, enabling the Mooncake Backend and Mooncake EP may slightly decrease performance. This is expected, since the Mooncake Backend is not yet fully optimized; optimization work will be carried out separately.

We compared test_mooncake_ep_small.py with test_deepep_small.py.

Throughput (token/s):

TestCase	test_mooncake_ep_small.py	test_deepep_small.py
TestPureDP	1581.135	2011.321
TestHybridDPTP	1324.707	1773.038
TestTP	1484.145	1668.714

Latency (s):

TestCase	test_mooncake_ep_small.py	test_deepep_small.py
TestPureDP	11.561	9.063
TestHybridDPTP	13.639	10.206
TestTP	12.122	10.523

pd disaggregation use

We conducted an end-to-end test of this pr in a PD separation scenario on two H20-3e

prefill

GLOO_SOCKET_IFNAME=eth0 NCCL_IB_HCA=mlx5_ NCCL_IB_DISABLE=0 NCCL_SOCKET_IFNAME=eth0 CCL_IB_GID_INDEX=3 SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK=1 NCCL_MIN_NCHANNELS=24 NCCL_IB_QPS_PER_CONNECTION=8 SGL_ENABLE_JIT_DEEPGEMM=1 python3 -m sglang.launch_server --disaggregation-ib-device  "mlx5_1,mlx5_2,mlx5_3,mlx5_4" --model-path /data/models/DeepSeek-R1-0528/ --tp 8 --disaggregation-mode prefill  --host 0.0.0.0 --port 30300 --moe-a2a-backend deepep --deepep-mode normal --disable-radix-cache --max-running-requests 16 --moe-dense-tp-size 1 --chunked-prefill-size 0 --trust-remote-code --watchdog-timeout 1000000  --enable-dp-attention --dp-size 4  --mem-fraction-static 0.8 --show-time-cost  --enable-dp-lm-head --page-size 64 --enable-eplb --eplb-rebalance-num-iterations 10000 --load-balance-method round_robin

decode

GLOO_SOCKET_IFNAME=eth0 NCCL_IB_DISABLE=0 NCCL_SOCKET_IFNAME=eth0 SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK=1 NCCL_MIN_NCHANNELS=24 NCCL_IB_QPS_PER_CONNECTION=8  SGL_ENABLE_JIT_DEEPGEMM=1 SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK=1 MC_CUSTOM_TOPO_JSON=/sgl-workspace/nic_priority_matrix.json MC_WORKERS_PER_CTX=1 python3 -m sglang.launch_server --model-path /data/models/DeepSeek-R1-0528/ --tp 8 --disaggregation-mode decode  --disaggregation-ib-device  "mlx5_1,mlx5_2,mlx5_3,mlx5_4" --host 0.0.0.0 --port 30300  --moe-a2a-backend mooncake   --disable-radix-cache --mem-fraction-static 0.8 --max-running-requests 1024  --moe-dense-tp-size 1 --cuda-graph-bs 32  --watchdog-timeout 1000000  --enable-dp-attention --dp-size 8 --trust-remote-code --show-time-cost --enable-dp-lm-head --nnodes 1 --node-rank 0  --enable-metrics --decode-log-interval 1 --enable-request-time-stats-logging --warmups 10 --log-level debug --log-requests --log-requests-level 3 --prefill-round-robin-balance --dist-backend mooncake

router

python3 -m sglang_router.launch_router --host 0.0.0.0 --port 8000 --pd-disaggregation --prefill-policy cache_aware --decode-policy round_robin --service-discovery --service-discovery-port 30300 --prefill-selector role=leader component=prefill --decode-selector role=leader component=decode --service-discovery-namespace default --request-timeout-secs 600 --max-concurrent-requests 1024 --model-path /data/models/DeepSeek-R1-0528/

simpole curl

curl http://127.0.0.1:8000/v1/chat/completions -H "Content-Type: application/json" -d '{"model": "/data/models/DeepSeek-R1-0528", "temperature": 0, "messages": [ {"role": "user", "content": "做一份北京出行攻略"}], "max_tokens": 300}'

{"id":"3f890c4a2b5445c0aabffdab05469787","object":"chat.completion","created":1757923786,"model":"/data/models/DeepSeek-R1-0528","choices":[{"index":0,"message":{"role":"assistant","content":"<think>\n嗯，用户想要一份北京出行攻略。这个需求很常见，但需要仔细考虑用户可能的背景和需求深度。用户没有说明旅行时长、预算、同行人员或兴趣偏好，所以攻略需要保持通用性，同时覆盖不同人群的可能需求。\n\n北京作为超大型旅游城市，攻略必须分层设计。首先考虑经典路线，这是大多数首次游客的核心需求。故宫、长城、天安门这些标志性景点必须包含，但要注意提醒用户提前预约——最近几年北京热门景点不预约根本进不去，很多外地游客会忽略这点。\n\n交通部分要特别强调地铁的便利性。北京地面交通太拥堵，地铁虽然拥挤但准时，尤其是1号线、2号线这些经典线路。不过也要提醒避开早晚高峰，不然用户体验会很差。机场快轨和北京南站到市区的方式也得写清楚，这是游客刚落地最需要的实用信息。\n\n美食推荐要平衡名气和本地特色。全聚德虽然被说商业化，但第一次来的人总想尝尝；同时必须加入尹三豆汁、门框卤煮这种地道小吃，但得标注清楚“豆汁慎点”——很多人喝不惯。四季民福烤鸭比全聚德性价比高，这个提示对预算有限的游客很实用。\n\n考虑到不同游客类型，补充了亲子、文化等主题路线。带孩子的家庭肯定需要环球影城和科技馆信息；文艺青年可能更喜欢798和胡同。还想到最近流行的","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"length","matched_stop":null}],"usage":{"prompt_tokens":8,"total_tokens":308,"completion_tokens":300,"prompt_tokens_details":null,"reasoning_tokens":0},"metadata":{"weight_version":"default"}}

Next Steps

This PR establishes the foundation by introducing the Mooncake Backend and Mooncake EP. The following steps will build upon it:

Integrate with EPLB: Adapt the EPLB algorithm to avoid mapping experts to faulty ranks, using Mooncake’s fault-awareness.
Expert Redistribution: Trigger expert rebalancing when new rank failures occur, ensuring that every expert has at least one healthy replica.
Scheduler Enhancements: Make the scheduler fault-aware, so failed requests can be retried or rerouted to healthy ranks automatically.
Stability and Performance Testing: Broaden testing coverage with larger TP sizes, prefill-decode disaggregation, and multiple-rank failure scenarios.

These will be addressed in subsequent PRs as outlined in the milestone roadmap in #8961 (comment).

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

* fix * fix * tiny fix * fix

gemini-code-assist

Summary of Changes

Hello @UNIDY2002, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request lays the foundational groundwork for fault-tolerant distributed Mixture-of-Experts (MoE) inference by integrating the new Mooncake communication backend. It introduces Mooncake as a drop-in replacement for torch.distributed primitives and as an alternative all-to-all backend for MoE layers, designed to provide fault tolerance. While current benchmarks indicate a slight performance decrease, this is expected as optimization work is planned for future iterations. This PR is a crucial first step towards building a more robust and resilient distributed inference system.

Highlights

New Communication Backend: Introduced the Mooncake Backend to provide fault tolerance for torch.distributed primitives, offering an alternative to NCCL, Gloo, etc.
Mooncake Expert Parallelism (EP): Implemented Mooncake EP as a new all-to-all backend for Mixture-of-Experts (MoE) layers, replacing the existing DeepEP backend with a fault-tolerant solution.
Non-Intrusive Integration: The Mooncake Backend integrates seamlessly with torch.distributed.init_process_group by registering 'mooncake' as a custom backend, requiring minimal code changes.
New Dispatcher Implementation: A new MooncakeEPDispatcher class has been added, providing the core logic for Mooncake EP's dispatch and combine operations, including handling of broken ranks.
Expanded Configuration Options: New command-line arguments --dist-backend mooncake and --moe-a2a-backend mooncake are available to enable the new fault-tolerant features.
New Unit Tests: A comprehensive new unit test file (test_mooncake_ep_small.py) has been added to validate the correctness of the Mooncake Backend and Mooncake EP across various distributed configurations (Pure DP, Hybrid DPTP, TP, TBO).

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces the Mooncake backend and Mooncake EP to support fault-tolerant distributed inference. The changes are well-structured and integrate the new backend into the existing system. I've added a few comments to improve code maintainability and user experience, such as refining error messages, renaming confusing variables, and suggesting refactoring for duplicated code. The addition of new tests for Mooncake EP is also a great step towards ensuring correctness.

python/sglang/srt/layers/moe/token_dispatcher/mooncake.py

python/sglang/srt/models/deepseek_v2.py

gemini-code-assist · 2025-09-14T11:52:53Z

python/sglang/srt/server_args.py

+        if self.moe_a2a_backend == "mooncake":
+            self.ep_size = self.tp_size
+            logger.warning(
+                f"Mooncake MoE is enabled. The expert parallel size is adjusted to be the same as the tensor parallel size[{self.tp_size}]."
+            )


This block is very similar to the one for deepep on lines 665-669. Consider refactoring them into a single block to avoid code duplication, for example: if self.moe_a2a_backend in ["deepep", "mooncake"]:.

whybeyoung · 2025-09-15T06:58:56Z

LGTM

python/sglang/srt/layers/moe/token_dispatcher/mooncake.py

python/sglang/srt/server_args.py

python/sglang/srt/models/deepseek_v2.py

ShangmingCai · 2025-09-22T09:41:49Z

python/sglang/srt/layers/moe/token_dispatcher/mooncake.py

+def get_ep_active_ranks() -> torch.Tensor:
+    assert _ACTIVE_RANKS is not None, "_ACTIVE_RANKS is not initialized"
+    return _ACTIVE_RANKS


I don't see any place that uses get_ep_active_ranks this util, and _ACTIVE_RANKS hasn't been updated after initialization (L156-L159), so why not just use self.active_ranks? Is this left for future usage?

Yes. In the following PRs, _ACTIVE_RANKS will be used by the model_runner to know whether a rank has been broken and determine whether to trigger a redistribution of expert weights.

python/sglang/srt/layers/moe/token_dispatcher/mooncake.py

python/sglang/srt/models/deepseek_v2.py

python/sglang/srt/server_args.py

…mooncake`

UNIDY2002 and others added 2 commits September 11, 2025 16:20

Introduce Mooncake Backend and Mooncake EP

8a62c51

tiny fix mooncake pr (#12)

1078fb2

* fix * fix * tiny fix * fix

UNIDY2002 requested review from merrymercy, Ying1123, hnyls2002, zhyncs, ispobock, HaiShaw, ch-wan, BBuf, kushanam, Edwardf0t1 and yizhang2077 as code owners September 14, 2025 11:49

gemini-code-assist bot reviewed Sep 14, 2025

View reviewed changes

Merge branch 'main' into mooncake-pr

74843fb

zhyncs assigned ch-wan Sep 14, 2025

zhyncs added the high priority label Sep 14, 2025

UNIDY2002 and others added 2 commits September 15, 2025 10:36

Fix for more readable code

fa7d9d2

Merge branch 'main' into mooncake-pr

3ebf53b

ShangmingCai mentioned this pull request Sep 15, 2025

[Roadmap] Distributed Serving Enhancement on 2025 H2 #8210

Open

22 tasks

HanHan009527 and others added 2 commits September 15, 2025 16:52

lint

6a795a4

API: change broken_ranks to active_ranks

96b2c6b

HanHan009527 mentioned this pull request Sep 18, 2025

[2/N] Added the core structure of elastic EP and the eplb algorithm with faulty rank #10606

Draft

4 tasks