-
Notifications
You must be signed in to change notification settings - Fork 3.1k
[1/N] Introduce Mooncake Backend and Mooncake EP to Support Elastic EP #10423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
* fix * fix * tiny fix * fix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @UNIDY2002, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request lays the foundational groundwork for fault-tolerant distributed Mixture-of-Experts (MoE) inference by integrating the new Mooncake communication backend. It introduces Mooncake as a drop-in replacement for torch.distributed
primitives and as an alternative all-to-all backend for MoE layers, designed to provide fault tolerance. While current benchmarks indicate a slight performance decrease, this is expected as optimization work is planned for future iterations. This PR is a crucial first step towards building a more robust and resilient distributed inference system.
Highlights
- New Communication Backend: Introduced the Mooncake Backend to provide fault tolerance for
torch.distributed
primitives, offering an alternative to NCCL, Gloo, etc. - Mooncake Expert Parallelism (EP): Implemented Mooncake EP as a new all-to-all backend for Mixture-of-Experts (MoE) layers, replacing the existing DeepEP backend with a fault-tolerant solution.
- Non-Intrusive Integration: The Mooncake Backend integrates seamlessly with
torch.distributed.init_process_group
by registering 'mooncake' as a custom backend, requiring minimal code changes. - New Dispatcher Implementation: A new
MooncakeEPDispatcher
class has been added, providing the core logic for Mooncake EP's dispatch and combine operations, including handling of broken ranks. - Expanded Configuration Options: New command-line arguments
--dist-backend mooncake
and--moe-a2a-backend mooncake
are available to enable the new fault-tolerant features. - New Unit Tests: A comprehensive new unit test file (
test_mooncake_ep_small.py
) has been added to validate the correctness of the Mooncake Backend and Mooncake EP across various distributed configurations (Pure DP, Hybrid DPTP, TP, TBO).
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces the Mooncake backend and Mooncake EP to support fault-tolerant distributed inference. The changes are well-structured and integrate the new backend into the existing system. I've added a few comments to improve code maintainability and user experience, such as refining error messages, renaming confusing variables, and suggesting refactoring for duplicated code. The addition of new tests for Mooncake EP is also a great step towards ensuring correctness.
if self.moe_a2a_backend == "mooncake": | ||
self.ep_size = self.tp_size | ||
logger.warning( | ||
f"Mooncake MoE is enabled. The expert parallel size is adjusted to be the same as the tensor parallel size[{self.tp_size}]." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM |
def get_ep_active_ranks() -> torch.Tensor: | ||
assert _ACTIVE_RANKS is not None, "_ACTIVE_RANKS is not initialized" | ||
return _ACTIVE_RANKS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any place that uses get_ep_active_ranks
this util, and _ACTIVE_RANKS
hasn't been updated after initialization (L156-L159), so why not just use self.active_ranks
? Is this left for future usage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. In the following PRs, _ACTIVE_RANKS
will be used by the model_runner to know whether a rank has been broken and determine whether to trigger a redistribution of expert weights.
Motivation
Why do we need a new backend?
The Mooncake Backend is a collective communication backend that provides fault tolerance while remaining seamlessly compatible with PyTorch. To the best of our knowledge, this is the first backend that supports fault tolerance for both:
torch.distributed
primitives, andThis lays the foundation for fault-tolerant distributed MoE inference (#8961).
Modifications
The Mooncake Backend and Mooncake EP can be enabled via two options:
Mooncake Backend
--dist-backend mooncake
.torch.distributed
with Mooncake."mooncake"
as a custom backend intorch.distributed.init_process_group
.Mooncake EP
--moe-a2a-backend mooncake
.MooncakeEPDispatcher
intoken_dispatcher/mooncake.py
.Accuracy Tests
ep/test_mooncake_ep_small.py
to validate correctness of the new features.Caveat:
The Mooncake package with the new backend support will be published soon.
For now, you may download the wheel named with "+ep" from the CI artifacts of kvcache-ai/Mooncake#805.
Benchmarking and Profiling
According to our unit tests, enabling the Mooncake Backend and Mooncake EP may slightly decrease performance. This is expected, since the Mooncake Backend is not yet fully optimized; optimization work will be carried out separately.
We compared
test_mooncake_ep_small.py
withtest_deepep_small.py
.Throughput (token/s):
Latency (s):
pd disaggregation use
We conducted an end-to-end test of this pr in a PD separation scenario on two H20-3e
prefill
decode
router
simpole curl
Next Steps
This PR establishes the foundation by introducing the Mooncake Backend and Mooncake EP. The following steps will build upon it:
These will be addressed in subsequent PRs as outlined in the milestone roadmap in #8961 (comment).
Checklist