Skip to content

Conversation

yuan-luo
Copy link
Collaborator

@yuan-luo yuan-luo commented Sep 17, 2025

Motivation

This PR is to support PyTorch Symm Mem AllReduce which support large message package size.

Delivers performance gains for scenarios with msg_size > 512KB.
Currently, MSCCLPP only supports a maximum message size of 1MB; messages larger than 1MB will error out.
Symm Mem AllReduce supports messages ≥ 2MB.
This PR lays the groundwork for long-context support and enables large-message AllReduce (AR).

Torch Symm Mem AllReduce is based on NVLS,
On Blackwell, NVLS is always deterministic.
On Hopper, NVLS is deterministic with cuda 12.8+ or updated cuda 12.4

SYMM_MEM:
| msg_size   |   torch eager time |   symm mem eager time |   symm mem graph time |   pynccl graph time |
|------------|--------------------|-----------------------|-----------------------|---------------------|
| 2.0 KiB    |            48.6752 |               42.6112 |               8.95296 |             16.4637 |
| 4.0 KiB    |            26.304  |               27.6096 |               8.99104 |             18.583  |
| 8.0 KiB    |            21.9744 |               27.3632 |               9.21984 |             18.6733 |
| 16.0 KiB   |            20.3968 |               27.1584 |               9.3568  |             19.1792 |
| 32.0 KiB   |            20.2848 |               34.1888 |               9.40416 |             20.8093 |
| 64.0 KiB   |            20.5088 |               26.5888 |               9.73728 |             20.968  |
| 128.0 KiB  |           110.179  |               26.8448 |              10.0224  |             21.5664 |
| 256.0 KiB  |            20.9184 |               29.7504 |              10.6275  |             21.7779 |
| 512.0 KiB  |            23.5904 |               27.5392 |              12.2198  |             21.9856 |
| 1.0 MiB    |            28.1664 |               29.1744 |              14.2813  |             27.6038 |
|------------|--------------------|-----------------------|-----------------------|---------------------|
| 2.0 MiB    |            39.2576 |               29.1744 |              19.0234  |             39.5312 |  <<<<<< MSCCLPP doesn't support from this size on
| 4.0 MiB    |            52.5408 |               33.4528 |              28.1082  |             52.8202 |
| 8.0 MiB    |            78.064  |               74.0832 |              46.6918  |             86.2893 |
| 16.0 MiB   |           118.87   |               93.2256 |              88.1763  |            119.501  |

MSCCLPP:
| msg_size   |   torch eager time |   msccl eager time |   msccl graph time |   pynccl graph time |
|------------|--------------------|--------------------|--------------------|---------------------|
| 2.0 KiB    |            67.6608 |            19.7952 |            7.17696 |             18.5763 |
| 4.0 KiB    |            22.7936 |            19.248  |            6.94688 |             18.6624 |
| 8.0 KiB    |            20.5888 |            19.4688 |            7.32512 |             19.0259 |
| 16.0 KiB   |            21.4752 |            20.9152 |            7.58144 |             19.3904 |
| 32.0 KiB   |            21.4208 |            21.1424 |            7.55168 |             20.9411 |
| 64.0 KiB   |            21.84   |            19.8304 |            8.00928 |             21.1987 |
| 128.0 KiB  |            20.9728 |            20.72   |            8.91872 |             21.479  |
| 256.0 KiB  |            24.2208 |            20.2752 |            9.8064  |             21.7882 |
| 512.0 KiB  |            24.4448 |            21.9616 |           11.9722  |             22.0272 |
| 1.0 MiB    |            26.8448 |            22.1696 |           18.2694  |             27.7914 |

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @yuan-luo, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates PyTorch Symmetric Memory AllReduce into the system, providing a more robust and performant solution for distributed communication, especially for large data transfers. By extending support beyond the limitations of existing AllReduce mechanisms, it enhances the system's capability to handle demanding workloads, such as those found in long-context language models, ensuring better scalability and efficiency in distributed environments.

Highlights

  • PyTorch Symmetric Memory AllReduce Support: This pull request introduces support for PyTorch Symmetric Memory AllReduce, a new communication primitive designed to handle larger message sizes more efficiently than existing methods like MSCCLPP.
  • Performance Gains for Large Messages: The new Symmetric Memory AllReduce delivers significant performance improvements for message sizes greater than 512KB, particularly for messages 2MB and larger, where MSCCLPP previously hit its limits.
  • Groundwork for Long-Context Models: This feature lays crucial groundwork for supporting long-context models and enabling large-message AllReduce operations, which are essential for scaling such models.
  • Comprehensive Benchmarking and Testing: New benchmark scripts and unit tests have been added to validate the performance and correctness of the Symmetric Memory AllReduce implementation across various message sizes and data types.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for PyTorch Symmetric Memory AllReduce, aiming to improve performance for large message sizes. The changes include a new SymmMemCommunicator, integration into the distributed parallel state management, and new benchmarks and tests.

My review has identified a critical bug in parallel_state.py due to a variable name typo. Additionally, I've found several areas for improvement across the new files, including removing dead code and unused imports, clarifying magic numbers with comments, and making the implementation more flexible by avoiding hardcoded data types. The new tests could also be more robust to ensure the new feature is exercised correctly across different configurations. Overall, the changes are well-structured, but addressing these points will enhance the code's correctness and maintainability.

@trevor-m
Copy link
Collaborator

Hi @yuan-luo #8238 enabled NCCL symmetric memory all-reduce with --enable-symm-mem. It avoids a copy into the symm memory buffer by using a custom memory allocation pool.
That approach used pynccl because native torch support for symmetric memory wasn't added yet.

@yuan-luo
Copy link
Collaborator Author

yuan-luo commented Sep 18, 2025

Hi @yuan-luo #8238 enabled NCCL symmetric memory all-reduce with --enable-symm-mem. It avoids a copy into the symm memory buffer by using a custom memory allocation pool. That approach used pynccl because native torch support for symmetric memory wasn't added yet.

@trevor-m I knew this PR. Both are two different mechnisms. One uses NCCL and the other uses Torch mechanism. I'll add a new feature flag to distinguish them.
Per test result the Torch Symm Mem can provide more robust and high performance AllReduce in large package size.
Btw, vLLM has set Torch Symmetrix Memory All-Reduce as the default All-Reduce which was custom All-Reduce previously.

@yuan-luo yuan-luo changed the title Support PyTorch Symm Mem AllReduce Support Torch Symm Mem AllReduce Sep 18, 2025
@yuan-luo yuan-luo changed the title Support Torch Symm Mem AllReduce [WIP] Support Torch Symm Mem AllReduce Sep 18, 2025
@yuan-luo
Copy link
Collaborator Author

yuan-luo commented Sep 18, 2025

New server_args --enable-torch-symm-mem E2E test passed.

$python3 -m sglang.launch_server --model-path /home/admin/DeepSeek-R1 --host 0.0.0.0 --port 30000 --trust-remote-code --enable-cache-report --quantization fp8 --log-level info --max-running-requests 32 --mem-fraction-static 0.92 --chunked-prefill-size 16384 --context-length 65535 --attention-backend flashinfer --disable-radix-cache --page-size 64 --tp-size 8 --enable-metrics --cuda-graph-max-bs 32 --enable-torch-symm-mem
......

[2025-09-18 17:13:48] INFO:     Started server process [80415]
[2025-09-18 17:13:48] INFO:     Waiting for application startup.
[2025-09-18 17:13:48] INFO:     Application startup complete.
[2025-09-18 17:13:48] INFO:     Uvicorn running on http://0.0.0.0:30000 (Press CTRL+C to quit)
[2025-09-18 17:13:49] INFO:     127.0.0.1:46614 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-09-18 17:13:49 TP0] Prefill batch. #new-seq: 1, #new-token: 64, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, 
[2025-09-18 17:13:49 TP0] Entering DeepGEMM JIT Pre-Compile session. It may take a long time (typically 10-20 mins) if you have not run `sglang.compile_deep_gemm`. It is recommended to run `sglang.compile_deep_gemm` with same args as `sglang.launch_server` for pre-compilation to reduce the overhead if you have not run it before. For example: `python3 -m sglang.compile_deep_gemm --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code`
[2025-09-18 17:13:49 TP0] Try DeepGEMM JIT Compiling for <GEMM_NT_F8F8BF16> N=4096, K=512, num_groups=1 with all Ms. It only takes a little time (typically 1 sec) if you have run `python3 -m sglang.compile_deep_gemm`. 
DeepGEMM warmup: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 32768/32768 [00:06<00:00, 5397.02it/s]
[2025-09-18 17:13:58] INFO:     127.0.0.1:46618 - "POST /generate HTTP/1.1" 200 OK
[2025-09-18 17:13:58] The server is fired up and ready to roll!
[2025-09-18 17:14:19 TP0] Prefill batch. #new-seq: 1, #new-token: 64, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, 
[2025-09-18 17:14:19 TP0] Decode batch. #running-req: 1, #token: 64, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1.26, #queue-req: 0, 
[2025-09-18 17:14:20 TP0] Decode batch. #running-req: 1, #token: 128, token usage: 0.00, cuda graph: True, gen throughput (token/s): 84.71, #queue-req: 0, 
[2025-09-18 17:14:20 TP0] Decode batch. #running-req: 1, #token: 192, token usage: 0.00, cuda graph: True, gen throughput (token/s): 84.76, #queue-req: 0, 
[2025-09-18 17:14:21 TP0] Decode batch. #running-req: 1, #token: 192, token usage: 0.00, cuda graph: True, gen throughput (token/s): 84.95, #queue-req: 0, 
[2025-09-18 17:14:21 TP0] Decode batch. #running-req: 1, #token: 256, token usage: 0.00, cuda graph: True, gen throughput (token/s): 84.86, #queue-req: 0, 
[2025-09-18 17:14:21] INFO:     127.0.0.1:39634 - "POST /v1/chat/completions HTTP/1.1" 200 OK

Client:

$cat test_openai.py
import openai
client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="EMPTY")
# Chat completion
response = client.chat.completions.create(
    model="default",
    messages=[
        {"role": "system", "content": "You are a helpful AI assistant"},
        {"role": "user", "content": "List 3 countries and their capitals. Tell me how you rank them"},
    ],
    temperature=0,
    max_tokens=200,
)
print(response)

$python test_openai.py 
ChatCompletion(id='b060b6252b784549b14f4268ba7a3c47', choices=[Choice(finish_reason='length', index=0, logprobs=None, message=ChatCompletionMessage(content="Okay, the user wants me to list three countries and their capitals and then rank them. Let me start by picking three countries. Maybe I should choose well-known ones to keep it simple. Let's see, France with Paris, Japan with Tokyo, and Brazil with Brasília. That covers different continents.\n\nNow, how to rank them. The user didn't specify the criteria, so I need to decide on a basis. Population? Economic size? Area? Or maybe cultural significance. Since capitals are involved, maybe the population of the capital cities? Or perhaps the historical importance of the capitals. Alternatively, alphabetical order, but that's not really a ranking. \n\nWait, the user might want a subjective ranking based on some interesting factors. Let me think. If I rank them by the population of the capitals, Tokyo is the largest, then Paris, then Brasília. But maybe the user wants a more engaging reason. For example, cultural influence, tourism, or something like that. \n\n", refusal=None, role='assistant', annotations=None, audio=None, function_call=None, tool_calls=None, reasoning_content=None), matched_stop=None)], created=1758186861, model='default', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=200, prompt_tokens=25, total_tokens=225, completion_tokens_details=None, prompt_tokens_details=None, reasoning_tokens=0), metadata={'weight_version': 'default'})

@yuan-luo yuan-luo changed the title [WIP] Support Torch Symm Mem AllReduce [Feat] Support Torch Symm Mem AllReduce Sep 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants