-
Notifications
You must be signed in to change notification settings - Fork 3.1k
[Feat] Support Torch Symm Mem AllReduce #10571
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @yuan-luo, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request integrates PyTorch Symmetric Memory AllReduce into the system, providing a more robust and performant solution for distributed communication, especially for large data transfers. By extending support beyond the limitations of existing AllReduce mechanisms, it enhances the system's capability to handle demanding workloads, such as those found in long-context language models, ensuring better scalability and efficiency in distributed environments.
Highlights
- PyTorch Symmetric Memory AllReduce Support: This pull request introduces support for PyTorch Symmetric Memory AllReduce, a new communication primitive designed to handle larger message sizes more efficiently than existing methods like MSCCLPP.
- Performance Gains for Large Messages: The new Symmetric Memory AllReduce delivers significant performance improvements for message sizes greater than 512KB, particularly for messages 2MB and larger, where MSCCLPP previously hit its limits.
- Groundwork for Long-Context Models: This feature lays crucial groundwork for supporting long-context models and enabling large-message AllReduce operations, which are essential for scaling such models.
- Comprehensive Benchmarking and Testing: New benchmark scripts and unit tests have been added to validate the performance and correctness of the Symmetric Memory AllReduce implementation across various message sizes and data types.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point by creating a comment using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands on the current page.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in pull request comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for PyTorch Symmetric Memory AllReduce, aiming to improve performance for large message sizes. The changes include a new SymmMemCommunicator
, integration into the distributed parallel state management, and new benchmarks and tests.
My review has identified a critical bug in parallel_state.py
due to a variable name typo. Additionally, I've found several areas for improvement across the new files, including removing dead code and unused imports, clarifying magic numbers with comments, and making the implementation more flexible by avoiding hardcoded data types. The new tests could also be more robust to ensure the new feature is exercised correctly across different configurations. Overall, the changes are well-structured, but addressing these points will enhance the code's correctness and maintainability.
python/sglang/srt/distributed/device_communicators/all_reduce_utils.py
Outdated
Show resolved
Hide resolved
54d6aa0
to
3dd7ce9
Compare
@trevor-m I knew this PR. Both are two different mechnisms. One uses NCCL and the other uses Torch mechanism. I'll add a new feature flag to distinguish them. |
3dd7ce9
to
5435e8a
Compare
New server_args --enable-torch-symm-mem E2E test passed.
Client:
|
5435e8a
to
f5803c7
Compare
5c4f1b0
to
cf1e296
Compare
Motivation
This PR is to support PyTorch Symm Mem AllReduce which support large message package size.
Delivers performance gains for scenarios with msg_size > 512KB.
Currently, MSCCLPP only supports a maximum message size of 1MB; messages larger than 1MB will error out.
Symm Mem AllReduce supports messages ≥ 2MB.
This PR lays the groundwork for long-context support and enables large-message AllReduce (AR).
Torch Symm Mem AllReduce is based on NVLS,
On Blackwell, NVLS is always deterministic.
On Hopper, NVLS is deterministic with cuda 12.8+ or updated cuda 12.4
Modifications
Accuracy Tests
Benchmarking and Profiling
Checklist